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Abstract 

This survey article discusses the main concepts and techniques of Stein's method 
for distributional approximation by the normal, Poisson, exponential, and geometric 
distributions, and also its relation to concentration inequalities. The material is pre- 
sented at a level accessible to beginning graduate students studying probability with 
the main emphasis on the themes that are common to these topics and also to much 
of the Stein's method literature. 
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1 Introduction 

The fundamental example of the type of result we will deal with in this article is the following 
version of the classical Berry-Esseen bound for the central limit theorem. 

Theorem 1.1. [13] Let Ai, A2, . . . he i.i.d. random variables with E|Aip < 00, E[Ai] = 0, 
and Var(Ai) = 1. // $ denotes the c.d.f. of a standard normal distribution and Wn = 

\F{Wn ^x)- ^{x) \ ^ 1.88 ' }} . 
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The theorem quantifies the error in the central hmit theorem and has many related 
embellishments such as assuming independent, but not identically distributed variables, or 
allowing a specified dependence structure. The proofs of such results typically rely on char- 
acteristic function (Fourier) analysis whereby showing convergence is significantly easier than 
obtaining error bounds. 

More generally, a central theme of probability theory is proving distributional limit theo- 
rems, and for the purpose of approximation it is of interest to quantify the rate of convergence 
in such results. However, many of the methods commonly employed to show distributional 
convergence (e.g. Fourier analyisis and method of moments) only possibly yield an error 
rate after serious added effort. Stein's method is a technique that can quantify the error in 
the approximation of one distribution by another in a variety of metrics; note that this last 
remark has a wider scope than the discussion above. 

Stein's method was initially conceived by Charles Stein in the seminal paper [51] to 
provide errors in the approximation by the normal distribution of the distribution of the 
sum of dependent random variables of a certain structure. However, the ideas presented 
in [51] are sufficiently abstract and powerful to be able to work well beyond that intended 
purpose, applying to approximation of more general random variables by distributions other 
than the normal (such as the Poisson, exponential, etc). 

Broadly speaking. Stein's method has two components: the first is a framework to convert 
the problem of bounding the error in the approximation of one distribution of interest by 
another, well understood distribution (e.g. the normal) into a problem of bounding the 
expectation of a certain functional of the random variable of interest (see (2.5) for the 
normal distribution and (4.4) for the Poisson). The second component of Stein's method are 
techniques to bound the expectation appearing in the first component; Stein appropriately 
refers to this step as "auxiliary randomization." With this in mind, it is no surprise that 
Stein's monograph [52], which reformulates the method in a more coherent form than [51], 
is titled "Approximate Computation of Expectations." 

There are now hundreds of papers expanding and applying this basic framework above. 
For the first component, converting to a problem of bounding a certain expectation involving 
the distribution of interest has been achieved for many well-known distributions. Moreover, 
canonical methods have been established for achieving this conversion for new distributions 
[22, 43] (although by no means is this process easy or guaranteed to be fruitful). 

For the second component, there is now an array of coupling techniques available to bound 
these functionals for various distributions. Moreover, these coupling techniques can be used 
in other types of problems which can be distilled into bounding expectations of a function 
of a distribution of interest. Two examples of the types of problems where this program has 
succeeded are concentration inequalities [18, 28, 29] (using the well known Proposition 7.1 
below), and local limit theorems [47]. We will cover the former example in this article. 

The purpose of this document is to attempt to elucidate the workings of these two 
components at a basic level in order to help make Stein's method more accessible to the 
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uninitiated. There are numerous other introductions to Stein's method which this document 
draws from, mainly [23, 24] for Normal approximation, [11, 20] for Poisson approximation, 
and an amalgamation of related topics in the collections [10, 26]. Most of these references 
focus on one distribution or variation of Stein's method in order to achieve depth, so there 
are themes and ideas that appear throughout the method which can be difficult to glean 
from these references. We hope to capture these fundamental concepts in uniform language 
to give easier entrance to the vast literature on Stein's method and applications. A similar 
undertaking but with smaller scope can be found in Chapter 2 of [48], which also serves as 
a nice introduction to the basics of Stein's method. 

Of course the purpose of Stein's method is to prove approximation results, so we will 
illustrate concepts in examples and applications, many of which are combinatorial in nature. 
In order to facilitate exposition, we will typically work out examples and applications only 
in the most straightforward way and provide pointers to the literature where variations of 
the arguments produce more thorough results. 

The layout of this document is as follows. In Section 2, we discuss the basic framework 
of the ffist component above in the context of Stein's method for normal approximation, 
since this setting is the most studied and contains many of the concepts we will need later. 
In Section 3 we discuss the commonly employed couplings used in normal approximation 
to achieve the second component above. We follow the paradigm of these two sections in 
discussing Stein's method for Poisson approximation in Section 4, exponential approximation 
in Section 5, and geometric approximation in Section 6. In the final Section 7 we discuss how 
to use some of the coupling constructions of Section 3 to prove concentration inequalities. 

We conclude this section with a discussion of necessary background and notation. 

1.1 Background and notation 

This is a document based on a graduate course given at U.C Berkeley in the Spring semester 
of 2011 and is aimed at an audience having seen probability theory at the level of [34]. That 
is, we do not rely heavily on measure theoretic concepts, but exposure at a heuristic level to 
concepts such as sigma-fields will be useful. Also, basic Markov chain theory concepts such 
as reversibility are assumed along with the notion of coupling random variables which will 
be used frequently in the sequel. 

Many of our applications will concern various statistics of Erdos-Renyi random graphs. 
We say G = G{n,p) is an Erdos-Renyi random graph on n vertices with edge probability p 
if for each pair of (2) vertices, there is an edge connecting the vertices with probability p 
(and no edge connecting them with probability 1 — p), independent of all other connections 
between other pairs of vertices. These objects are a simple and classical model of networks 
that are well studied; see [14, 36] for book length treatments. 

For a set A, we write I[- G A] to denote the function which is one on A and otherwise. We 
write g{n) x /(n) if g{n)/ f{n) tends to a positive constant as n — >■ 00, and g{n) = 0(/(n)) 
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if g{n)/f{n) is bounded as n — oo. 

Since Stein's method is mainly concerned with bounding the distance between probabihty 
distributions in a given metric, we now discuss the metrics we will use. 



1.1.1 Probability Metrics 

For two probability measures /i and u, the probability metrics we will use have the form 



where "H is some family of "test" functions. For random variables X and Y with respective 
laws yU and u, we will abuse notation and write d%{X^ Y) in place of dy,{fi, u). 

We now detail examples of metrics of this form along with some useful properties and 
relations. 

1. By taking = {![■ ^ x] : x G R} in (1.1), we obtain the Kolmogorov metric, which 
we denote d^- The Kolmogorov metric is the maximum distance between distribution 
functions, so a sequence of distributions converging to a fixed distribution in this metric 
implies weak convergence. 

2. By taking "H = {/i : R — )■ R : \h{x) — h{y)\ ^ |x— ?/|} in (1.1), we obtain the Wasserstein 
metric, which we denote The Wasserstein metric is a common metric occurring 
in many contexts and will be the main metric we use for approximation by continuous 
distributions. 

3. By taking = {^[A G R] : A e Borel(R)} in (1.1), we obtain the total variation 
metric, which we denote d^y. We will use the total variation metric for approximation 
by discrete distributions. 

Proposition 1.2. Retaining the notation for the metrics above, we have the following. 

1. For random variables W and Z , dY^{W., Z) ^ dT^y{W, Z). 

2. If the random variable Z has Lebesgue density bounded by C , then for any random 
variable W , 





dKiW.Z) ^ ^/2Cdy^{W,Z). 



3. 



For W and Z random variables taking values in a discrete space Vt, 
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Proof. The first item follows from the fact that the supremum on the right side of the 
inequality is over a larger set, and the third item is left as an exercise. For the second item, 
consider the functions hx{w) = I[w ^ x], and the 'smoothed' hx,e{w) defined to be one for 
w ^ X, zero for w > x + e, and linear between. Then we have 



Taking e = \/2 (i-w(W^, Z)/C shows half of the desired inequality and a similar argument 



Due to its importance in our framework, we reiterate the implication of Item 2 of the 
proposition that a bound on the Wasserstein metric between a given distribution and the 
normal or exponential distribution immediately yields a bound on the Kolmogorov metric. 

2 Normal Approximation 

The main idea behind Stein's method of distributional approximation is to replace the char- 
acteristic function typically used to show distributional convergence with a characterizing 
operator. 

Lemma 2.1 (Stein's Lemma). Define the functional operator A by 



1. If the random variable Z has the standard normal distribution, then 'EjAf{Z) = for 
all absolutely continuous f with E|/'(Z)| < oo. 

2. If for some random variable W , WjAfiW) = for all absolutely continuous functions 
f with E|/'(Z)| < oo, then W has the standard normal distribution. 

The operator A is referred to as a characterizing operator of the standard normal distribution. 

Before proving Lemma 2.1, we record the following lemma and then observe a conse- 
quence. 

Lemma 2.2. // $(a;) is the c.d.f. of the standard normal distribution, then the unique 
bounded solution fx of the differential equation 



Ehx{W) - Ehx{Z) = Ehx{W) - Ehx,eiZ) + Ehx,s{Z) - Ehx{Z) 
^ Ehx,s{W) - Ehx,e{Z) + Ce/2 
^ dy^{W,Z)/e + Ce/2. 



yields the other half. 



□ 



Afix) = f'{x)-xf{x). 



f'^{w) - wfx{w) = I[w ^ x] - $(x) 



(2.1) 
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is given by 

POO 

f^(yj) = e"''/' / e"*'/2 ($(0;) - I[t ^ x]) dt 

J w 

= -e"''/^ / e-*'/2 _ i[t ^ x]) dt. 



00 



Lemmas 2.1 and 2.2 are at the heart of Stein's method; observe the following corollary. 

Corollary 2.3. If fx is as defined in Lemma 2.2, then for any random variable W , 

\¥{W ^x)- $(x)| = WAW) - Wfx{W)]\- (2.2) 

Although Corollary 2.3 follows directly from Lemma 2.2, it is important to note that 
Lemma 2.1 suggests that (2.2) may be a fruitful equality. That is, the left hand side of (2.2) 
is zero for all x G R if and only if W has the standard normal distribution. Lemma 2.1 
indicates that the right hand side of (2.2) also has this property. 

Proof of Lemma 2.2. The method of integrating factors shows that 

d 



^ (e-'"'/V.(^)) = e-""'/' {I[w ^ x] - $(x)) 



dw 

which after integrating and considering the homogeneous solution implies that 



00 

2 



fx{w) = e"- /2 / e-* /2 _ i[t <: x]) dt + Ce'" (2.3) 



is the general solution of (2.1) for any constant C. To show that (2.3) is bounded for C = 
(and then clearly unbounded for other values of C) we use 



1-$(m;) ^min jj,^^le""''/^ > 0, 
1 2 w^2tx J 



which follows by considering derivatives. From this point we use the representation 

fx[w) = 



27re"' /2$(w)(l - $(x)), w^x 
2^e'"''/'^^x){l-^{w)), w>x 



to obtain that \\fx\\ ^ \/^- CH 



Proof of Lemma 2.1. We first prove Item 1 of the lemma. Let Z be a standard normal 
random variable and let / be absolutely continuous such that E|/'(Z)| < 00. Then we 
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have the following formal calculation (justified by Fubini's Theorem) which is essentially 
integration by parts. 



E/'(Z) 



27r 
1 



R 



nt)dt 

POO 

fit) / we-'"'/^dwdt + 



fit) / we~'"^'^dwdt 



we 



-'Up- 12 



f{t)dt 



dw 



we 



-up- 12 



f{t)dt 



dw 



nzfiz)]. 



For the second item of the Lemma, assume that W is a. random variable such that 
— = for all bounded, continuous, and piecewise continuously differen- 

tiable functions / with E|/'(Z)| < oo. The function f satisfying (2.1) is such a function, so 
that for all x G R, 



= E[fUW) - WUW)] = PiW ^ x) 
which implies that W has a standard normal distribution. 



$fx) 



□ 



Our strategy for bounding the maximum distance between the distribution function of 
a random variable W and that of the standard normal is now fairly obvious: we want to 
bound E[/a;(Vr) — Vr/a,(Vr)] for solving (2.1). This setup can work, but it turns out that 
it is easier to work in the Wasserstein metric. Since the critical property of the Kolmogorov 
metric that we use in the discussion above is the representation (1.1), which the Wasserstein 
metric shares, extending in this direction comes without great effort. 



2.1 The general setup 

For two random variables X and Y and some family of functions "H, recall the metric 

dn{X,Y) = sup|E/i(X) - Eh{Y)\, 
hen 



(2.4) 



and note that such a metric only depends on the law of X and Y. For /i G "H, let fh solve 

f,iw)-wMw) = hiw)-^h) 

where $(/i) is the expectation of h with respect to a standard normal distribution. We have 
the following result which easily follows from the discussion above. 

Proposition 2.4. If W is a random variable and Z has the standard normal distribution, 
then 



dn{W, Z) = sup \nfh{W) - WUW)]\. 



(2.5) 
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The main idea at this point is to bound the right side of (2.5) by using the structure of 
W and properties of the solutions fh- The latter issue is handled by the following lemma. 

Lemma 2.5. Let fh be the solution of the differential equation 

f',{w) - wfh{w) = h{w) - (2.6) 

which is given by 

POO 

/,(!/;) = e-'/^ / e~''/'mh)-h{t))dt 

J w 

/w 
e-*'/2 ($(/,) _ h{t)) dt. 
-oo 

1. If h is bounded, then 

\\M\^^\\H-)-mi and \M^2\\h{.)-mi 

2. If h is absolutely continuous, then 

m<2\\h'l \\fL\\<^\\h% and \\f;:\\^2\\h'\\. 

The proof of Lemma 2.5 is similar to but more technical than that of Lemma 2.2. We 
refer to [24] (Lemma 2.4) for the proof. 



3 Bounding the error 

We will focus mainly on the Wasserstein metric when approximating by continuous distri- 
butions. This is not a terrible concession as firstly the Wasserstein metric is a commonly 
used metric, and also by Proposition L2, for Z a standard normal random variable and W 
any random variable we have 

dK{W,Z) ^ {2/7TY/Wd^{W,Z), 

where o?k is the maximum difference between distribution functions (the Kolmogorov metric); 
c/k is an intuitive and standard metric to work with. 

The reason for using the Wasserstein metric is that it has the form (2.4) for the set 
of functions with Lipschitz constant equal to one. In particular, if /i is a test function for 
the Wasserstein metric, then \\h'\\ ^ 1 so that we know the solution fh of equation (2.6) 
is bounded with two bounded derivatives by Item 2 of Proposition 2.5. Contrast this to 
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the set of test functions for the Kolmogorov metric where the solution of equation (2.6) 
is bounded with one bounded derivative (by Item 1 of Proposition 2.5) but is not twice 
different iable. 

To summarize our progress to this point, we state the following result which is a corollary 
of Proposition 2.4 and Lemma 2.5. The theorem is at the kernel of Stein's method. 

Theorem 3.1. If W is a random variable and Z has the standard normal distribution, and 
we define the family of functions T = {f : ||/||, ||/"|| ^ 2, ||/'|| ^ -\/2/7r}, then 

dw{W,Z) ^ snp\E[f'{W) - Wf{W)]\. (3.1) 

In the remainder of this section, we discuss methods to bound |E[/'(iy) — H^/(W^)] | using 
the structure of W. We will identify general structures that are amenable to this task (for 
other structures in greater generality see [46]), but first we illustrate the type of result we 
are looking for in the following standard example. 



3.1 Sum of independent random variables 

We will show the following result which follows from Theorem 3.1 and Lemma 3.4 below. 

Theorem 3.2. Let Xi, . . . , X„ be independent random variables with Ej\Xi\'^ < oo, EXj = 0, 
and EXf = 1. IfW = (^"=i-^i)/v^ and Z has the standard normal distribution, then 



dw{w,z)^-^j2nx. 



V2 



4 = 1 



Before the proof we remark that if the Xi of the theorem also have common distribution, 
then the rate of convergence is order n~^^'^, which is the best possible. It is also useful to 
compare this result to Theorem 1 . 1 which is in a different metric (neither result is recoverable 
from the other in full strength) and only assumes third moments. A small modification in 
the argument below yields a similar theorem assuming only third moments, but the structure 
of proof for the theorem as stated is one that we shall copy in the sequel. 

In order to prepare for arguments to come, we will break the proof into a series of lemmas. 
Since our strategy is to apply Theorem 3.1 by estimating the right side of (3.1) for bounded 
/ with bounded first and second derivative, the first lemma shows an expansion of the right 
side of (3.1) using the structure of W as defined in Theorem 3.2. 



Lemma 3.3. In the notation of Theorem 3.2, ifWi = ^^^^^ 



E[Wf{W)] = E 



X^X, {f{W) - f{W.) -{W- Wi)f'{W)) 



(3.2) 
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+ E 



1 " 



{W) 



(3.3) 



Proof. After noting that the negative of (3.3) is contained in (3.2) and removing these terms 
from consideration, the lemma is equivalent to 



¥.\Wf{W)] = E 



J2(XJ{W)-XJ{W,)) 

. 1=1 



(3.4) 



Equation (3.4) follows easily from the fact that Wi is independent of so that E[Xj/(iyj)] = 
0. □ 

The proof of the theorem will follow after we show that (3.2) is small and that (3.3) 
compares favorably to f'{W)] we will see similar strategies frequently in the sequel. 

Lemma 3.4. If f is a bounded function with bounded first and second derivative, then in 
the notation of Theorem 3.2, 



\\f"\ 



iifi 



i=l 



n 



\ i=i 



(3.5) 



Proof. Using the notation and results of Lemma 3.3, we obtain 



\nf\w)-wf{w)] \ ^ 



E 



+ 



E 



n 

^ 5^ X, uiw) - fm ~{w- w;)f'{w)) 

f{W) \ l-l=Y^^X,{W-Wi)^ 



(3.6) 



(3.7) 



By Taylor expansion, the triangle inequality, and after pushing the absolute value inside the 
expectation, we obtain that (3.6) is bounded above by 

II r 



i=l 



Since {W — Wi) = Xi/y/n, we obtain the first term in the bound (3.5). We find that (3.7) 
is bounded above by 



ll/'l 



E 



n 



Ed - XI 



1=1 



ll/'l 



n 



,1=1 



where we have used the Cauchy-Schwarz inequality. By independence and the fact that 
Va.T{Xf) ^ E[Xj^], we obtain the second term in the bound (3.5). □ 
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We can see from the work above that the strategy to bound E[/'(iy) — is to 

use the structure of W to rewrite E[H^/(iy)] in a way that compares favorably to E[/'(H^)]. 
Rather than attempt this program anew in each apphcation that arises, we will develop 
out-the-door theorems that provide error terms for various canonical structures which arise 
in many applications. 



3.2 Dependency Neighborhoods 

We now generalize Theorem 3.2 to sums of random variables with local dependence. 

Definition 3.1. We say that a collection of random variables {Xi, . . . , X„) has dependency 
neighborhoods A^j C {1, . . . , n}, z = 1, . . . , n, if Xj is independent of {Xj}ji^N.. 

If we think of constructing a graph with vertices {1, . . . ,n} where if there is no edge 
between i and j then and Xj are independent, then we can define Ni/{i} as the neighbors 
of vertex i in the graph. For this reason, dependency neighborhoods are frequently referred 
to as dependency graphs. Using the Stein's method framework and a modification of the 
argument for sums of independent random variables we can prove the following theorem, 
some version of which can be read from the main result of [9]. 

Theorem 3.5. Let Xi, . . . , Xn be random variables with ^[X^] < oo, E[Xj] = 0, = 
Var(^jXj), and define W = ^iXi/a. Let the collection have dependency 

neighborhoods Ni, i = 1, . . . ,n, with D := maxis;js;„ |A^j|. Then for Z a standard normal 
random variable, 



J2E[Xf]. (3.8) 



i=l 



Note that this theorem quantifies the heuristic that a sum of many locally dependent 
random variables will be approximately normal. When viewed as an asymptotic result, it's 
clear that under some conditions a CLT will hold even with D growing with n. It is also 
possible to prove similar theorems using further information about the dependence structure 
of the variables; see [25]. 

The proof of the theorem will be analogous to the case of sums of independent random 
variables (a special case of this theorem), but the analysis will be a little more complicated 
due to the dependence. 

Proof. From Theorem 3.1, to upper bound d^^{W,Z) it is enough to bound |E[/'(iy) — 

where ||/||, ||/" || ^ 2 and ||f || ^ Define Wi = J^jm 

is independent of Wi. As in the proof of Theorem 3.2, we can now write 



\n.f\w)-wf{w)\ \ ^ 



E 



1 " 

- ^ X, {f{w) - f{w.) -{w- w:)f\wy) 

a ^ — ' 



(3.9) 
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E 



/ n 

f{w)[i--Y,Uw-w,, 

\ ^ i=l 



(3.10) 



We now proceed by showing that (3.9) is bounded above by the first term in (3.8) and (3.10) 
is bounded above by the second. 

By Taylor expansion, the triangle inequality, and after pushing the absolute value inside 
the expectation, we obtain that (3.9) is bounded above by 



\\n 

2a 



n 1 " 



(3.11) 



=1 i=l 

1 " 

i=l j,k€Ni 

The arithmetic-geometric mean inequality implies that 

E \X,X,X,\ ^ ^ {E\X,\'' + E|X,f + E\Xkf) , 

so that (3.9) is bounded above by the first term in the bound (3.8), where we use for example 
that 

n n 
i=l j,kGNi j=l 

Similar consideration implies that (3.10) is bounded above by 

V2 



ll/'l 



■E 



a 



v^M E E ^^^^ 



(3.12) 



where the inequality follows from the Cauchy-Schwarz inequality coupled with the represen- 
tation 



= E 



The remainder of the proof consists of analysis on (3.12), but note that in practice it may 
be possible to bound this term directly. In order to bound the variance under the square 
root in (3.12), we first compute 



E 



,i=l j^N, 



nx^x,x,x,[ 



(3.13) 
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+ E E + E E E E[xfx,x,]. (3.14) 

i=i jeNi 1=1 jeN, keN,/{j} 

Using the arithmetic-geometric mean inequahty, the first term of (3.14) is bounded above 
by 

^ n n 

2 E E inxf] + E[Xf]) ^ D ^ E[Xf], 

i=l j£Ni i=l 

and the second by 

^ n n 

iEE E {mxf] + E[x;] + E[xt])^D{D~i)Y,nxf]. 

i=i jeNi keN,/{j} i=i 
We decompose the term (3.13) into two components; 

E E E ^t^*^^-^'^^'] = E nwk]E[x,xi]+ Hw.XkXi], (3.15) 

i^j keNi l£Nj {i,k},{j,l} {i,k,j,l} 

where the first sum denotes the indices in which {Xi,Xk} are independent of {Xj,Xi}, and 
the second term consists of those remaining. Note that by the arithmetic-geometric mean 
inequahty, the second term of (3.15) is bounded above by 

n 

6D'J2^[Xf], 
1=1 

since the number of "connected components" with at most four vertices of the dependency 
graph induced by the neighborhoods, is no more than D x 2D x 3-D. The first term of (3.15) 
equals 

a^- ^ E[X,Xk]E[X,Xi], 

{i,k,j,l} 

and a couple apphcations of the arithmetic-geometric mean inequality yields 

-E[X,XkMX,Xi] ^ ^ {E[X,Xkf + E[X,Xif) 
< I imfXi] + E[X]xn) 
^ \ {E[Xf] + E[Xf] + E[Xt] + E[Xf]) . 
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Putting everything together, we obtain that 





) 



2 



\i=l jeN, ) \j=lj6Af, 



n 



n 



i=\ i=l 



which yields the theorem. 



□ 



Note that much of the proof of Theorem 3.5 consists of bounding the error in a simple 
form. However, an upper bound for d-\n/{W, Z) is obtained by adding the intermediate terms 
(3.11) and (3.12) which in many applications may be directly bounded (and produce better 
bounds) . 

Theorem 3.5 is an intuitively pleasing result that has many applications; a notable ex- 
ample is [8] where CLTs for statistics of various random geometric graphs are shown. We 
apply it in the following setting. 

3.2.1 Application: Triangles in Erdos-Renyi random graphs 

Let G = G{n,p) be an Erdos-Renyi random graph on n vertices with edge probability p and 
let T be the number of triangles in G. We can write T = Xlili^*; where = (g), and 
the Yi is the indicator that a triangle is formed at the "ith" set of three vertices, in some 
arbitrary but fixed order. For i ^ j, Yi is independent of Yj if and only if the collection of 
edges between the vertices indexed by i is disjoint from those indexed by j. Thus we let the 
set Ni/{i} contain indices which share exactly two vertices with those indexed by i so that 
I A''j| = 3(n — 3) + 1 and we can apply Theorem 3.5 with Xi = Yi— and D = 3n — 8. Since 



we now only have to compute Var(T) to apply the theorem. A simple calculation using a 
decomposition of T into indicators shows that 



and Theorem 3.5 implies that for W = [T — E[T])/(T and Z a standard normal random 
variable 



E\Xi\'' = p3(i _ p^)[(i - p3)fc-i ^ p3(fc-i)j^ A; = 1, 2, . . . 
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This bound holds for all n ^ 3 and ^ p ^ 1, but some asymptotic analysis shows that 
if, for example, p ~ n~° for some ^ a < 1 (so that Var(T) — )■ oo), then the number of 
triangles satisfies a CLT for ^ a < 2/9, which is only a subset of the regime where normal 
convergence holds [49]. It is possible that starting from (3.11) and (3.12) would yield better 
rates in a wider regime, and considering finer structure yields better results [12]. 

3.3 Exchangeable pairs 

We begin with a definition. 

Definition 3.2. The ordered pair [W, W) of random variables is called an exchangeable 

pair if {W,W') = {W',W). If for some < a ^ 1, the exchangeable pair (VT, ly') satisfies 
the relation 

E[iy'|iy] = {l-a)W, 
then we call {W, W) an a-Stein pair. 

The next proposition contains some easy facts related to Stein pairs. 
Proposition 3.6. Let (W, W) an exchangeable pair. 

1. If F : H"^ ^ a is an anti- symmetric function; that is F{x,y) = —F{y,x), then 
E[F{W,W')] = 0. 

// [W, W) is an a-Stein pair with Vai{W) = cr^, then 

2. E[W] = and E[{W' - Wf] = 2aa^. 

Proof. Item 1 follows by the following equalities, the first by exchangeability and the second 
by anti-symmetry of F. 

E[F{W,W')] = E[F{W',W)] = -E[F{W,W')]. 

The first assertion of Item 2 follows from the fact that E[W] = E[W'] = (1 - a)E[PF], and 
the second by calculating 

E[{W' - Wf] = E[{W'f] + E[W^] - 2E[WE[W'\W]] = - 2(1 - a)cT^ = 2ao^ . 

□ 

From this point we illustrate the use of the exchangeable pair in the following theorem. 
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Theorem 3.7. If {W, W) is an a-Stein pair with E[W'^] = 1 and Z has the standard normal 
distribution, then 



Before the proof comes a few remarks. 

Remark 3.3. The strategy for using Theorem 3.7 to obtain an error in the approximation 
of the distribution of a random variable W by the standard normal is to construct W on the 
same space as W, such that {W, W) is an a-Stein pair. How can we achieve this construction? 
Typically W = W{u) is a random variable on some space Q with probability measure /z. It 
is not too difficult to see that if Xq, Xi, ... is a Markov chain in stationary which is reversible 
with respect to /i, then setting (with some abusive notation) W = W{Xo) and W = W{Xi) 
defines an exchangeable pair. Since there is much effort put into constructing reversible 
Markov chains (e.g. Gibbs sampler), this is a useful method to construct exchangeable 
pairs. However, the linearity condition is not as easily abstractly constructed and must be 
verified. 

Remark 3.4. In lieu of the previous remark, it is useful to note that 



for any sigma-field J-" which is larger than the sigma-field generated by W. With notation 
in the previous remark, in many instances it is helpful to condition on Xo rather than W 
when computing the error bound from Theorem 3.7. 

Remark 3.5. A heuristic explanation for the form of the error terms appearing in Theorem 
3.7 arises by considering an Ornstein-Uhlenbeck (0-U) diffusion process. Define the diffusion 
process {D(t))t^Q by the following properties. 

1. E[D{t + a) - D{t)\D{t) =x] = -ax + o(a). 

2. E[{D{t + a) - D{t)Y\D{t) = x] = 2a + o(a). 

3. For all e > 0, P[\D{t + a) - D{t)\ > e\D{t) = x] = o(a). 

Here the function g{a) is o(a) if g{a)/a tends to zero as a tends to zero. These three properties 
determine the 0-U diffusion process, and this process is reversible with the standard normal 
distribution as its stationary distribution. What does this have to do with Theorem 3.16? 
Roughly, if we think of W as D{t) and W as D{t + a) for some small a, then Item 1 
corresponds to the a-Stein pair linearity condition. Item 2 implies that the first term of the 
error in Theorem 3.7 will be small, and Item 3 relates to the second term in the error. 



^yVai {E[{W' - W)^\W]) E\W'-W\^ 




3a 



Var {E[{W' - Wf\W]) ^ Var {E[{W' - Wf\J^]) , 
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Proof of Theorem 3. 7. The strategy of the proof is to use the exchangeable pair to rewrite 
E[iy/(iy)] in such a way that compares favorably to E[/'(iy)]. To this end, let / be bounded 
with bounded first and second derivative and let F{w) := Jq f{t)dt. Now, exchangeability 
and Taylor expansion imply that 



= ¥j[F{W' 
= E 



F{W)] 



{W - w)f{w) + -{W - wYf\w) + -{W - wff"{w*) 

2 6 



(3.16) 



where W* is a random quantity in the interval with endpoints W and W. Now, the linearity 
condition on the Stein pair yields 



E [{W - W)f{W)] = E[f{W)mW' - W)\W]] = -aE[Wf{W)]. 
Combining (3.16) and (3.17) we obtain 

From this point we can easily see 



(3.17) 



2a 



6a 



\nfiw)-wf{w)]\^\\f\\E 



E[{W' -Wf\W] 



2a 



\\n 



E\W' -W\^ 
6a 



(3.18) 



and the theorem will follow after noting that we are only considering / with ||/'|| ^ y'2/'iT, 
and ||/"|| ^ 2, and that from Item 2 of Proposition 3.6, we have E[E[(iy' - M^)2|iy]] = 2a 
so that an application of the Cauchy-Schwarz inequality yields the variance term in the 
bound. □ 

Before moving to a heavier application, we consider the canonical example of a sum of 
independent random variables. 

Example 3.6. Let Xi, . . . ,X„ independent with E[Xf] < oo, E[Xi] = 0, Var(Xi) = 1, and 
W = n-^'^Yri=i^i- We construct our exchangeable pair by choosing an index uniformly 
at random and replacing it by an independent copy. Formally, let / uniform on {1, . . . , n}, 
(X(, . . . , X^) be an independent copy of (Xi, . . . , X„), and define 

W' = W-^ + ^. 

'n Jn 



It is a simple exercise to show that {W, W) is exchangeable, and we now verify that is also 
a l/ra-Stein pair. The calculation below is straightforward; in the penultimate equality we 
use the independence of Xi and X[ and the fact that E[X-] = 0. 



E[W' -W\{X^,...,X^)] 



E[x;-x,|(Xi,...,x„)] 
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11 

= ^_E[x:-X,|(Xi,...,X„)] 
* 1=1 

- _i V — - -— 
n ^ ^Jn n 

i=l * 

Since the conditioning on the larger sigma- field only depends on W, we have that E[iy — 
= -W/n, as desired. 
We can now apply Theorem 3.7. We first bound 

1 " 

i=l 
n 

i=l 

where we used the arithmetic-geometric mean inequality for the cross terms of the expansion 
of the cube of the difference (we could also express the error in terms of these lower moments 
by independence). Next we compute 

1 " 

i=l 
1 " 

1=1 

Taking the variance we see that 

v^T{E[{w'-wr\w])^-j2mt]. 

1=1 

Combining the estimates above we have 

1=1 

Note that if the Xi are i.i.d. then this term is of order n~^/^, which is best possible. Finally, 
we could probably get away with only assuming three moments for the X^ if we use the 
intermediate term (3.18) in the proof of Theorem 3.16. 
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3.3.1 Application: Anti-voter model 



In this section we consider an application of Theorem 3.7 found in [44]; we closely follow their 
treatment. Let G be an r-regular^ graph with vertex set V and edge set E. Define a Markov 
chain on the space { — 1,1}^ of labelings of the vertices of V by +1 and —1. The chain 
follows the rule of uniformly choosing a vertex v & V , then uniformly choosing a neighbor of 
V and changing the sign of the label of v to the opposite of its neighbor. The model gets its 
name from thinking of the vertices as people in a town full of curmudgeons where a positive 
(negative) labeling corresponding to a yes (no) vote for some measure. At each time unit a 
random person talks to a random neighbor and decides to switch votes to the opposite of 
that neighbor. 

It is known [1] that if the underlying graph G is not bipartite or a cycle, then the 
anti-voter chain is irreducible and aperiodic and has a unique stationary distribution. This 
distribution can be difficult to describe, but we can use Theorem 3.7 to obtain an error in 
the Wasserstein distance to the standard normal distribution for the sum of the labels of 
the vertices. We now state the theorem and postpone discussion of computing the relevant 
quantities in the error until after the proof. 

Theorem 3.8. Let G he an r-regular graph with n vertices which is not bipartite or a cycle. 
Let X = (Xj)"^^ G { — 1,1}" have the stationary distribution of the anti-voter chain and 
let X* = (X-)"^^ be one step in the chain. Let = Var(^. Xj), W = cr~^^^Xi, and 
W = cr~^J2i-^'i- Then (W,W') is a 2/n-Stein pair, and if Z has the standard normal 
distribution, then 



and Ni denotes the neighbors of i. 

Part of the first assertion of the theorem is that {W, W) is exchangeable, which is non- 
trivial to verify since the anti-voter chain is not necessarily reversible. However, we can 
apply the following lemma - the proof here appears in [45]. 

Lemma 3.9. IfW and W are identically distributed integer-valued random variables defined 
on the same space such that P{\W' — W\ ^ 1) = 1, then {W,W') is an exchangeable pair. 

^The term r-regular means that every vertex has degree r. 




An v^Var(g) 




where 



n 



i=l jc^Ni 
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Proof. The fact that W and W only differ by at most one almost surely imply 



P{W' ^k) = F{W <k) + F{W = k,W' ^k)+ F{W = k + l,W' = k), 

while we also have 

P{W ^k) = P{W <k)+ F{W = k,W' ^k) + F{W = k,W' = k + 1). 

Since W and W have the same distribution, the left hand sides of the equations above are 
equal, and equating the right hand sides yields 

p{W = k + l,W' = k) = P{W = k,W' = k + 1), 

which is the lemma. □ 

Proof of Theorem 3.8. For the proof below let a := an so that aW = ^2^=1-^^- "^he ex- 
changeability of {W, W) follows by Lemma 3.9 since F{a{W' -W)/2e {-1, 0, 1}) = 1. To 
show the linearity condition for the Stein pair, we define some auxiliary quantities related to 
X. Let ai = ai(X) be the number of edges in G which have a one at each end vertex when 
labeled by X. Similarly, let a_i be the analogous quantity with negative ones at each end 
vertex and oq be the number of edges with a different labal at each end vertex. Due to the 
fact that G is r-regular, the number of ones in X is (2ai + ao)/r and the number of negative 
ones in X is (2a_i + ao)/r. Note that these two observation imply 

aW = - (ai - a-i) . (3.19) 
r 

Now, since conditional on X the event aW = aW + 2 is equal to the event that the chain 
moves to X' by choosing a vertex labeled —1 and then choosing a neighbor with label — 1, 
we have 

F(a(W' -W) = 2|X) = ^ (3.20) 

nr 



and similarly 



F(a(W' -W) = -2\X) = —. (3.21) 

nr 



Using these last two formulas and (3.19), we obtain 

E[a(W' - W)\X] = — (a_i - ai) = 

nr n 

as desired. 
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From this point we will compute the error terms from Theorem 3.7. The first thing to 
note is that \W' — W\ ^ 2/a implies 

E\W'-W\^ An 



3a 3cr^' 

which contributes the first part of the error from the Theorem. Now, (3.20) and (3.21) imply 



E[(W' -Wy\X.] = ^—{a.i + ai), (3.22) 
a^nr 



and since 



we have 



2ai + 2a„i + 2ao = 1 = nr, 

1=1 jeNi 

n 

2ai + 2a„i - 2ao = ^ ^ XiXj = Q, 

i=i jeNi 



Q = 4(a_i + ai) - rn, 



which combining with (3.22) and a small calculation yields the second error term of the 
theorem. □ 

In order for Theorem 3.8 to be useful for a given graph G, we need lower bounds on 
and upper bounds on Vai{Q). The former item can be accomplished by the following result 
of [1] (Chapter 14). 

Lemma 3.10. [1] Let G be an r -regular graph and let n = k{G) be the minimum over subsets 
of vertices A of the quantity of edges that have both ends in A or both ends in A'^. If a"^ is 
the variance of the stationary distribution of the anti-voter model on G, then 

— ^ ^ n. 
r 

The strategy to upper bound Var(Q) is to associate the anti-voter model to a so-called 
"dual process" from interacting particle system theory. This discussion is outside the scope 
of our work, but see [1, 27, 44]. 
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3.4 Size-bias coupling 



Our next method of rewriting E[H^/(iy)] to be compared to E[/'(Vr)] is through the size- 
bias couphng which first appeared in the context of Stein's method for normal approximation 
in 



Definition 3.7. For a random variable X ^ with E[X] = /i < oo, we say the random 
variable has the size-bias distribution with respect to X if for all / such that E|X/(X) | < 
oo we have 

E[X/(X)]=/iE[/m]. 

Before discussing existence of the size-bias distribution, we remark that our use of X^ 
is a bit more transparent than the exchangeable pair. To wit, if Var(X) = cr^ < oo and 
W = {X - then 



E[Wf{W)] = E 



a 



f 



X 



a 



a 



f 



X' 



a 



f 



X 



a 



(3.23) 



so that if / is differentiable, then the Taylor expansion of (3.23) about W allows us to 
compare E[iy/(iy)] to E[/'(iy)]. We will make this precise shortly, but first we tie up a 
loose end. 

Proposition 3.11. If X ^ Q is a random variable with E[X] = /i < oo and distribution 
function F, then the size-bias distribution of X is absolutely continuous with respect to the 
measure of X with density read from 

dF\x) = -dFix). 
/i 

Corollary 3.12. If X ^ is an integer-valued random variable with E[X] 
the random variable X^ with the size-bias distribution of X is such that 

kP{X = k) 



fi < oo then 



P{X' = k) 



The size-bias distribution arises in other contexts such as the waiting time paradox and 
sampling theory [3]. We now record our main Stein's method size-bias normal approximation 
theorem. 

Theorem 3.13. Let X ^ be a random variable with E[X] = fi < oo and Var(X) = cr^. 
Let X* be defined on the same space as X and have the size-bias distribution with respect to 
X. IfW = (X-/i)/(T and Z ^ N{0,1), then 

dy,{W, Z) ^ 4\pVVar(E[X--X|X]) + -^E[(X^ - Xf]. 
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Proof. Our strategy (as usual) is to bound |E[/'(Vr) — H^/(W^)]| for / bounded with two 
bounded derivatives. Starting from (3.23), a Taylor expansion yields 



E[Wf{W)] = -E 



a 



X' ~X fx - fi\ {X' - Xf ^„ f X* - 



cr 



a 



2(t2 



cr 



for some X* in the interval with endpoints X and X'^ . Using the definition of W in terms 
of X in the previous expression, we obtain 



|E[/'(iy) - Wf{W)]\ ^ E \f'{W) (I ^2 



+ 



JJ_ 

2(t3 



E 



a J 



{X' - xy 



(3.24) 

(3.25) 



Since we are taking the supremum over functions / with ||/'|| ^ v^Vvr and ||/"|| ^ 2, it is 
clear that (3.25) is bounded above by the second term of the error stated in the theorem 
and (3.24) is bounded above by 



E 



/i 



1 - ^E[X^ -X|X] 



cr^ 



4./^VVar(E[X«-X|X]); 



here we use the Cauchy-Schwarz inequality after noting that by the definition of X^, E[X^] = 
(a^ + □ 



3.4.1 Coupling construction 

At this point it is appropriate to discuss methods to couple a random variable X to a size- 
bias version X'^ . In the case that X = ^"^^Xj, where Xj ^ and E[Xj] = /i,, we have the 
following recipe for constructing a size-bias version of X. 

1. For each i = l,...,n, let Xf have the size-bias distribution of Xj independent of 
{Xj)j^i and {Xj)j^i. Given X/ = x, define the vector (xj*^)jy.j to have the distribution 
of {Xj)j^i conditional on Xj = x. 

2. Choose a random summand X/, where the index / is chosen proportional to /ij and 
independent of all else. Specifically, P(J = -i) = where /i = E[X]. 

3. Define X^ = E^y/^f ^ + ^1- 

Proposition 3.14. Let X = Er=i^i' ^^^^ > ^' H^i] = (^^d /i = E[X] = ^./Xj. // 
X" is constructed by Items 1-3 above, then X^ has the size-bias distribution of X. 
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Proof. Let X = (Xi, . . . , Xn) and for i = 1, . . . , n, let X* be a vector with coordinate j equal 
to xj*"* for j i and coordinate i equal to X* as in item 1 above. In order to prove the 
result, it is enough to show 

E[W^/(X)]=^E[/(X^)], (3.26) 

for / : R*^ — > R such that E|iy/(X)| < oo. Equation (3.26) follows easily after we show 
that for alH = 1, . . . , n, 

E[X,/(X)] =^,E[/(X^)]. (3.27) 
To see (3.27), note that for h{Xi) = E[/(X)|Xi], 

E[X,/(X)] = E[XMX^)] 

= /i,E[/i(x;)], 

which is the right hand side of (3.27). □ 
Note the following special cases of Proposition 3.14. 

Corollary 3.15. Let Xi, . . . , Xn be non-negative independent random variables with E[Xj] = 
fii, and for each i = l,...,n, let X/ have the size-bias distribution of Xi independent of 
{Xj)j-^i and {Xj)j-^i. If X = ^"=iXj, n = E[X], and I is chosen independent of all else 
with P(/ = i) = fii/fi, then X* = X — Xj + X| has the size-bias distribution of X. 

Corollary 3.16. Let Xi, . . . ,X„, be zero-one random variables with P(Xi = 1) = pi. For 
each i = l,...,n, let (xj*'')j^j have the distribution of {Xj)j^i conditional on Xj = 1. // 
X = X]r=i"^»' ^^ ~ ^'^^ ^ chosen independent of all else with P(/ = i) = Pi/ fi, then 

^' = E,v/^f^ + l has the size-bias distribution of X . 

Proof. Corollary 3.23 is obvious since due to independence, the conditioning in the construc- 
tion has no effect. Corollary 3.16 follows after noting that for Xj a zero-one random variable, 
Xf = 1. □ 



3.4.2 Applications 

Example 3.8. We can use Corollary 3.15 in Theorem 3.13 to bound the Wasserstein distance 
between the normalized sum of independent variables with finite third moment and the 
normal distribution - we leave this as an exercise. 

Example 3.9. Let G = G{n,p) be an Erdos-Renyi graph and for i = 1, . . . , n, let Xj be the 
indicator that vertex Vi (under some arbitrary but fixed labeling) has degree zero so that 
X = Y27=i -^i number of isolated vertices of G. We will use Theorem 3.13 to obtain an 
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upper bound on the Wasserstein metric between the normal distribution and the distribution 
oiW = {X- n)/a where n = E[X] and = Var(X). 

Since X is a sum of identically distributed indicators, we can use Corollary 3.16 to 
construct X^, a size-bias version of X. Corollary 3.16 states that in order to size-bias X, we 
first choose an index / uniformly at random from the set {1, . . . ,n}, then size-bias Xj by 
setting it equal to one, and finally adjust the remaining summands conditional on Xj = 1 
(the new size-bias value). We can realize Xf = 1 by erasing any edges connected to vertex 
vi. Given that Xj = 1 {vj is isolated), the graph G is just an Erdos-Renyi graph on the 
remaining n — 1 vertices. Thus X^ can be realized as the number of isolated vertices in G 
after erasing all the edges connected to vj. 

In order to apply Theorem 3.13 using this construction, we need to compute E[X], 
Var(X), Var(E[X'' - X\X]), and E[(X^ - Xf]. Since the chance that a given vertex is 
isolated is (1 — p)"'~^, we have 



^ ■= E[X] = n{l-p) 



and also that 



:= Var(X) = /i (l - (1 - p)"-^) + n{n - 1) Cov(Xi, X2) 

= fi[l + {np-l){l-pr~% (3.28) 

since E[XiX2] = (1 — p)^"~^. Let di be the degree of Vi in G and let Di be the number of 
vertices connected to Vi which have degree one. Then it is clear that 

X' -X = Di + I[dj > 0], 

so that 

Var(E[X'' - X|G]) = -^^ Var ( ^(A + > 0]) ) (3.29) 



2 



, 1=1 

n 



Var A + Var > 0] 



, 4=1 / \i=l 



(3.30) 



Since XlILi^M* > 0] = n — X, the second variance term of (3.30) is given by (3.28). Now, 
Yl^=i Di is the number of vertices in G with degree one which can be expressed as Y2^=i 
where Yi is the indicator that Vi has degree one in G. Thus, 

Var A j = n{n - l)p{l - pf-'^ (l - (ri - l)p(l - pY''^) + n{n - 1) Cov(ri, Y2) 
l)p{l - pT^^ [1 - (n - l)p{l - pT~^ + (1 - pT~^ + {n- 1) V(l - pT^^] , 



,i=l 

nin 
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since E[yil2] = p(l — + (n — — p)^"-^ j^^^j^e first term corresponds to Vi and f2 

being joined). 

The final term we need to bound is 



E[(X^ - Xf] = E [mX' - Xf\X]] 
1 " 

- E[( A + l[di > 



n 

i=l 



1 " 

^-VeKA + D'I 

n ' ^ 



n 

i=l 

= E[Dl]+2E[Di] + l. 

Expressing Di as a sum of indicators, it is not difficult to show 

E[Dl] = {n- l)p{l - + {n-l){n~ 2)/(l - 

and after noting that Di ^ almost surely, we can combine the estimates above with 
Theorem 3.13 to obtain an explicit upper bound between the distribution of W and the 
standard normal in the Wasserstein metric. In particular, we can read the following result 
from our work above. 

Theorem 3.17. If X is the number of isolated vertices in an Erdos-Renyi graph G{n,p), 
W = {X — jj)/a , and for some 1 ^ a < 2 we have hm„_j.oo n^V = c G (0, oo), then 

d^{W,Z)^-, 
a 

for some constant C . 

Proof. The asymptotic hypothesis lim„^oo n'^P = c G (0, oo) for some 1 ^ a < 2 implies 
that (1 — p)" tends to a finite positive constant. Thus we can see that fi n, a"^ 
Var(E[X^ - X\X]) x a^/n'^, and E[(X" - Xf] x n^"", from which the result follows from 
Theorem 3.13. □ 

Example 3.8 can be generalized to counts of vertices of a given degree d at some com- 
putational expense [31, 33]; related results pertain to the number of subgraphs counts in an 
Erdos-Renyi graph (such as the number of triangles) [31]. We will examine such construc- 
tions in greater detail in our treatment of Stein's method for Poisson approximation where 
the size-bias coupling will play a large role. 



3.5 Zero-bias coupling 

Our next method of rewriting E[iy/(iy)] to be compared to E[/'(iy)] is through the zero- 
bias coupling first introduced in [32]. 
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Definition 3.10. For a random variable W with ^[W] = and Var(Vr) = < oo, we say 
the random variable has the zero-bias distribution with respect to W if for all absolutely 
continuous / such that E|H^/(iy)| < oo we have 

E[Wf{W)] = a^E[f'{W% 

Before discussing existence and properties of the zero-bias distribution, we note that it 
is appropriate to view the zero-biasing as a distributional transform which has the normal 
distribution as its unique fixed point. Also note that zero-biasing is our most transparent 
effort to compare E[W^/(H^)] to E[/'(H^)], culminating in the following result. 

Theorem 3.18. Let W be a mean zero, variance one random variable and let be defined 
on the same space as W and have the zero-bias distribution with respect to W . If Z ~ A^(0, 1), 
then 

dw{W,Z) ^ 2E\W' - W\. 
Proof. Let J" be the set of functions such that||/'|| ^ ^/2/^ and ||/||, ||/"|| ^ 2. Then 

d^J^{W, Z) ^ sup \E[f'{W) - Wf{W)]\ 
= snp\E[f\W)-f'iWn]\ 
^ sup ||/"||E|iy - W'\ . 

□ 

Before proceeding further, we discuss some fundamental properties of the zero-bias dis- 
tribution. 

Proposition 3.19. Let W be a random variable with E[Vr] = and Var(H^) = o"^ < oo. 

1. There is a unique probability distribution for a random variable satisfying 

E[Wf{W)] = (T^E[f'{W')] (3.31) 
for all absolutely continuous f such that E\W f{W)\ < oo. 

2. The distribution of as defined by (3.31) is absolutely continuous with respect to 
Lebesgue measure with density 

p^{w) = a~^E [WI[W > w]] = -a-^E [WI[W ^ w]] . (3.32) 
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Proof. Assume that = 1; the proof for general a is similar. We will show Items 1 and 2 
simultaneously by showing that defined by (3.32) is a probability density which defines a 
distribution satisfying (3.31). 

Let f{x) = g(t)dt for a non-negative function g integrable on compact domains. Then 

roo poo 

I f{u)E[Wl[W > u]]du= / g{u)E[Wl[W > u]]du 
Jo Jo 

/.max{0,iy} 

= E[W / g{u)du = E[Wf{W)I[W ^0]. 

Jo 

and similarly f'{u)p^{u)du = E[Wf{W)I[W ^ 0], which implies that 

/ f{u)p^{u)du = E[Wf{W)] (3.33) 

for all / as above. However, (3.33) extends to all absolutely continuous / such that E|iy/(iy) | < 
oo by routine analytic considerations (e.g. considering the positive and negative part of /). 

We now show that p^ is a probability density. That p^ is non-negative follows by con- 
sidering the two representations in (3.32) - note that these representations are equal since 
E[iy] = 0. We also have 

/o 
p'{u)du = E[Wh[W < 0], 
-oo 

so that J^p'{u)du = E[W^] = 1. 

Finally, uniqueness follows since for random variables X and Y such that E[/'(X)] = 

E[/'(y)] for all continuously differentiable / with compact support (say), then X = Y. □ 

The next result shows that little generality is lost in only considering W with Var(M^) = 1 
as we have done in Theorem 3.18. The result can be read from the density formula above 
or by a direct computation. 

Proposition 3.20. If W has mean zero and finite variance then [aW)^ = aW^ . 



3.5.1 Coupling construction 

How do we construct a zero-bias coupling for a random variable? In general this can be 
difficult, but we now discuss the nicest case of a sum of independent random variables and 
work out a neat theoretical application using the construction. Another canonical method 
of construction that is useful in practice can be derived from a Stein pair - see [32]. 

Let Xi, . . . , Xn be independent random variables with E[Xi\ = 0, Var(Xj) = af, XlILi '^i ~ 
1, and define W = ^"^^ Xi. We have the following recipe for constructing a zero-bias version 

of ly. 
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1. For each i = l,...,n, let have the zero-bias distribution of Xj independent of 

{Xj)j^i and {X^)j^i. 

2. Choose a random summand Xf, where the index I satisfies P(/ = i) = af and is 
independent of all else. 

3. Define W = Y^j^i^j + Xj. 

Proposition 3.21. Let W = ^2^=1-^1 defined as above. IfW^ is constructed as per Items 
1-3 above, then has the zero-bias distribution ofW. 

Proof. We must show that E[Vr/(Vr)] = E[/'(Vr'')] for all appropriate /. Using the defini- 
tion of zero-biasing in the coordinate Xi and the fact that — Xj is independent of Xj, we 
have 

n 

E[Wf{W)] = ^ XJ{W - X, + X,) 

i=l 
n 

= Y.cr^f{W-X, + Xn 

1=1 

= E[f{W-Xj + Xf)]. 
Since ^j^j Xj + Xf = W — Xj + Xf, the proof is complete. □ 

3.5.2 Lindeberg-Feller condition 

We now discuss the way in which zero-biasing appears naturally in the proof of the Lindeberg- 
Feller CLT. Our treatment closely follows [30]. 

Let (Xj.„)i^„^i^j^„ be a triangular array of random variables^ such that Var(Xj.„) = 
af^ < oo. Let Wn = J2^=i-^i,n, and assume that Var(W„) = 1. A sufficient condition for 
Wn to satisfy a CLT as n — )■ oo is the Lindeberg condition: for all e > 0, 

n 

E[Xj„I[|Xj,„| > e]] ^0, as n ^ oo. (3.34) 

i=l 

The condition ensures that no single term dominates in the sum so that the limit is not 
altered by the distribution of a summand. Note that the condition is not sufficient as we 
could take Xi n to be standard normal and the rest of the terms zero. We now have the 
following result. 

^That is, for each n, {Xi n)i^i^n is a collection of independent random variables. 
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Theorem 3.22. Let (^i,n)i^n,is:i^n be the triangular array defined above and let In be a 
random variable independent of the Xi^n and such that P(/„ = i) = af^- -^'^'^ ^^^^ 1 ^i ^n, 
let have the zero-bias distribution of Xi,n independent of all else. Then the Lindeberg 
condition (3.34) holds if and only if 

^L,n A as n ^ oo. (3.35) 

From this point, we can use a modification of Theorem 3.18 to prove the foUowing result 
which also follows from Theorem 3.22 and the classical Lindeberg-Feller CLT mentioned 
above. 

Theorem 3.23. In the notation of Theorem 3.22 and the remarks directly preceding it, if 
Xj^ ^ — )■ m probability as n ^ oo, then Wn satisfies a CLT. 

Before proving these two results, we note that Theorem 3.23 is heuristically explained by 
Theorem 3.18 and the zero-bias construction of W„. Specifically, — = |X|^^ — X/„ „| 
and Theorem 3.18 implies that Wn is approximately normal if this latter quantity is small 
(in expectation). The proof of Theorem 3.23 uses a modification of the error in Theorem 
3.18 and the (non-trivial) fact that „ — ?■ in probability implies that Xj^ n — )■ in 
probability. Finally, the quantity \X^^ — Xi^n\ will also be small if is approximately 
normal, which indicates that the zero-bias approach will show convergence in the CLT for 
the sum of independent random variables when such a result holds. 

Proof of Theorem 3.22. We first perform a preliminary calculation to relate the Lindeberg- 
Feller condition to the zero-bias quantity of interest. For some fixed e > 0, let f'{x) = I[\x\ ^ 
e] and /(O) = 0. Using that xf{x) = (x^ — 5|x|)I[|x| ^ e] and the definition of the zero-bias 
transform, we find 

n 

n\XlJ>e) = Y.alnP{\XlJ^e) 

i=l 
n 

i=l 
n 

= ^E[(Xj„-e|X,,„|)I[|X,„| ^e]]. 

i=l 

From this point we note 

yl[|a;| ^ 2e\ ^ (x^ - e|a;|)I[|a;| ^ e] ^ xH[\x\ ^ e] 
which implies that for all e > 0, 

^ n n 

- J] E [Xlnl[\X,,n\ > 2e]] < ¥{\XIJ ^ e) ^ J] E [XlJ.[\X,,n\ > e\\ , 
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so that (3.34) and (3.35) are equivalent. □ 

Proof of Theorem 3.23. According to the proof of Theorem 3.18, it is enough to show that 

WiWn) - nW^)]\ ^ as n ^ oo (3.36) 

for all bounded / with two bounded derivatives. We will show that \W.^ — Wn\ — )■ in 
probability which implies (3.36) by the following calculation. 

¥{\nwn)-f{w:,)\^t)dt 

'•2||/'|l 

p(imn)-mn)i^^Mi 

2||/' II 



^ / ' ¥{\\f"\\\Wr,-w:^\^t)dt 

Jo 

rm'W 

^ / p{\Wn-w:\^t/\\f"\\)dt, 



which tends to zero by dominated convergence. 

We now must show that \W^ — Wn\ — )■ in probability. Since we are assuming that 
-^Ln — > in probability, and \W^ — Wn\ = \^i^n ~ -^in,n\y it is enough to show that 
^i„,n — )■ in probability. For e > 0, and m„ := maxi^j^„ a^^^, 

Var(X,„,„) 



F{\Xi„,n\>e) ^ 
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1 

i=\ 

n 

%=\ 

From this point we show m„ — 0, which will complete the proof. For any 5 > 0, we have 

= E[X2„I[|X,„| ^ 6]] + E[X,?„I[|X,„,| > 6]] 

^6^ + E[XlJ[\X,,^\>6]]. (3.37) 

Using the calculations in the proof of Theorem 3.22 based on the assumption that Xf^ n ~^ 
in probability, it follows that 



^E[Xy[|Xi,„| > 5]] ^ as ^ cx). 



1=1 



so that the second term of (3.37) goes to zero as n goes to infinity uniformly in i. Thus we 
have that limsup„ m„ ^ 5"^ for all 5 > which implies that m„ — )■ since m„ > 0. □ 
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3.6 Normal approximation in the Kolmogorov metric 

Our previous work has been to develop bounds on the Wasserstein metric between a distri- 
bution of interest and the normal distribution. For W a random variable and Z standard 
normal, we have the inequality 

dK{W,Z) ^ (2/7r)i/V^w(W^,^), 

so that our previous effort implies bounds for the Kolmogorov metric. However, it is often the 
case that this inequality is suboptimal - for example if is a standardized binomial random 
variable with parameters n and p, then both d^iW, Z) and d\\/{W, Z) are of order n"^/^. In 
this section we develop Stein's method for normal approximation in the Kolmogorov metric 
in hopes of reconciling this discrepancy.^ We follow [24] in our exposition below but similar 
results using related methods appear elsewhere [40, 44, 50]. 
Recall the following restatement of Corollary 2.3. 

Theorem 3.24. Let $ denote the standard normal distribution function and let fx{w) he 
the unique hounded solution of 

f',{w) - wf,{w) = I[w ^x]- (3.38) 

IfW is a random variable with finite mean and Z is standard normal, then 

dK{W, Z) = sup \E[fAW) - Wf,{W)]\. 

Moreover, we have the following lemma, which can be read from [24], Lemma 2.3. 
Lemma 3.25. If fx is the unique bounded solution to (3.38), then 




ll/^IK2, 



and for all u,v,w G H, 

\{w + u)fx{w + u)-{w + v)fx{w + tOK (kl + V2^/4)(|n| + \v\) 

Our program can be summed up in the following corollary to the results above. 

Corollary 3.26. If T is the set of functions satisfying the bounds of Lemma 3.25 and W is 
a random variable with finite mean and Z is standard normal, then 

dK{W, Z) ^ sup \nf\W) - Wf{W)]\. 

■^Of course improved rates will come at the cost of additional hypotheses, but we will see that the theorems 
are still useful in application. 
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3.6.1 Zero-bias transformation 

To get better out-the-door rates using the zero-bias transform, we must assume a bounded- 
ness condition. 

Theorem 3.27. Let W be a mean zero, variance one random variable and suppose there is 
having the zero-bias distribution ofW on the same space as W such that \ W^ — W\^6 
almost surely. If Z is standard normal, then 

Proof. Our strategy of proof is to show that the condition \W^ — W\ ^ 6 imphes that 
\dK(W,Z) — dK{W^ , Z)\ is bounded by a constant times 6. From this point we will only 
need to show that dxiW^, Z) is of order S, which is not as difficult due heuristically to the 
fact that the zero-bias transform is smooth (absolutely continuous with respect to Lebesgue 
measure). 

We implement the ffist part of the program. For 2; G R, 

¥(W ^z)- F{Z ^z)^ F{W ^z)- P(Z ^z + 5) + F{Z ^ z + 5) - F{Z ^ z) 

^ F{W' ^ z + 5) - F{Z ^z + 5)^ ^ 



^dK{W%Z) + -=, (3.39) 
V 27r 



where the second inequality follows since {W ^ z} {W^ ^ z + 5} and since Z has density 
bounded by (27r)~^/^. Similarly, 

F{W ^z)~ F{Z ^z)^ F{W ^z)- F{Z ^ z - 5) + F{Z ^ z - 5) - F{Z ^ z) 

^ F{W^ ^z-5)- F{Z ^z-6)- 



'2ti 

which after taking the supremum over z and combining with (3.39) implies that 

I dK{W, Z) - dKiW^ Z)\ ^ -=. (3.40) 

V Ztx 



Now, by Corollary 3.26 (and using the notation there), we have 

dK{W\ Z) ^ sup - W'f{W')] I , (3.41) 

and for f E J^, we find after using the definition of the zero-bias transform and Lemma 3.25 
\E[f'{W') - W'f{W')]\ = \E[Wf{W) - W'f{W')]\ 
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(3.42) 



Combining (3.40), (3.41), and (3.42) yields the theorem. □ 

Theorem 3.24 can be applied to sums of independent random variables which are almost 
surely bounded (note that W bounded implies bounded), and can also be used to derive 
a bound in Hoeffding's combinatorial CLT under some boundedness assumption. 



3.6.2 Exchangeable pairs 

To get better rates using exchangeable pairs, we again assume a boundedness condition. A 
slightly more general version of this theorem appears in [50]. 

Theorem 3.28. If {W, W) is an a-Stein pair with Var(W^) = 1 and such that \ W' — W\ ^ S, 
then 



dK{W,Z) ^ 



y/Var {E[{W' - Wy\W]) 6^ 36 



2a 2a 2 

Proof. Let the bounded solution of (3.38). Exchangeability implies 

nwfAw)] = - w){f,{w') - f^w))], 



so that we can see 

E[f',{W)-Wf,{W)] = E 



fLiW) { 1 
+ E 



2a 

' [fuw)-f:{w+t)]dt 



2a 



(3.43) 
(3.44) 



Exactly as in the proof of Theorem 3.7, (the result analogous to Theorem 3.28 but for the 
Wasserstein metric) the term (3.43) contributes the first error term from the theorem (using 
the bounds of Lemma 3.25). Now, since fx satisfies (3.38), we can rewrite (3.44) 



E 



W'-W 



2a 



+ E 



W'-W 



[Wfx{W)-{W + t)fx{W + t)]dt 



W'-W r^'-^ 

2a~ 



[I[W ^x]-I[W + t^x] dt 



(3.45) 
(3.46) 
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and we can apply Lemma 3.38 to find that the absolute value of (3.45) is bounded above by 



E 



\W'-W\ f^'-^ 



2a 



\W\ + 



\t\dt 



^ E 



\W' -W\ 



\W\ + 
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2a' 



In order to bound the absolute value of (3.46), we consider separately the cases W — W 
positive and negative. For example, 



E 



{W - W)I[W' < W] 
2^ 



I[x <W ^x- t]dt 



W'-W 



^ — E [(ly' - iy)2l[l^' < W^\x < ^ X + 5]] , 

where we have used that \W' — W\^b. A similar inequality can be obtained for W > W 
and combining these terms implies that the absolute value of (3.46) is bounded above 



2a 



E [{W - W)H[x <W ^x + 6]] 



(3.47) 



Lemma 3.29 below shows (3.47) is bounded above by 36/2, which proves the theorem. □ 

Lemma 3.29. // (W, W) is an a-Stein pair with Va.T{W) = 1 and such that \W' — W\ ^6, 
then for all x & H 

E [{W - W)H[x <W ^x + 5]] ^ 3Sa. 

Proof. Let g'{w) = I[x — 6 < w ^ x + 26] and g{x + S/2) = 0. Using that \\g\\ ^ 36/2 in the 
first inequality below, we have 

36a ^ 2aE[Wg{W)] 

= E[{W'-W){g{W')-g{W))] 
-w'-w 



E 



> E 



{W - W) 



{W - W) 



g'{W + t)dt 



w'-w 



l[x-6 <W + t^x + 26]I[x <W ^x + 6]dt 



E [{W' - Wfl[x <W ^x + 6]\, 



as desired. 



□ 



Theorem 3.28 can be applied to sums of independent random variables which are al- 
most surely bounded, and can also be applied to the anti-voter model to yield rates in the 
Kolmogorov metric that are comparable to those we obtained in the Wasserstein metric in 
Section 3.3.1. 
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4 Poisson Approximation 

One great advantage of Stein's method is that it can easily be adapted to various distribu- 
tions and metrics. In this section we develop Stein's method for bounding the total variation 
distance (see Section 1.1.1) between a distribution of interest and the Poisson distribution. 
We will move quickly through the material analogous to that of Section 2 for normal ap- 
proximation, as the general framework is similar. We follow the exposition of [11]. 

Lemma 4.1. For A > 0, define the functional operator A by 

Af{k) = Xf{k + l)-kf{k). 

1. If the random variable Z has the Poisson distribution with mean \, then E^/(Z) = 
for all bounded f . 

2. If for some non-negative integer-valued random variable W , "EjAfiW) = for all 
bounded functions f, then W has the Poisson distribution with mean A. 

The operator A is referred to as a characterizing operator of the Poisson distribution. 

Before proving the lemma, we state one more result and then its consequence. 

Lemma 4.2. Let V\ denote probability with respect to a Poisson distribution with mean A 
and A C N U {0}. The unique solution fA of 

XfA{k + 1) - kfA{k) = I[k eA]- VxiA) (4.1) 

with /a(0) = is given by 

fA{k) = A-V(fc - 1)! [Px{A n Uk) - VxiA)Vxm] , 

where ?7fc = {0, 1, . . . , A; — 1}. 

Analogous to normal approximation, this setup immediately yields the following promis- 
ing result. 

Corollary 4.3. If W is an integer-valued random variable with mean X, then 

\F{W eA)- Vx{A)\ = |E[A/^(1^ + 1) - WfA{W)]\ . 

Proof of Lemma 4-2. The relation (4.1) defines recursively, so it is obvious that the so- 
lution is unique under the boundary condition /a(0) = 0. The fact that the solution is as 
claimed can be easily verified by substitution into the recursion (4.1). □ 
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Proof of Lemma 4- i- Item 1 follows easily by direct calculation: if Z ~ Po(A) and / is 
bounded, then 

°° \k+i 

AE[/(Z + l)]=e-^^^/(A: + l) 

k=0 

°o \k+l 

= E[ZfiZ)]. 

For Item 2, let EAf{W) = for all bounded functions /. Lemma 4.4 below shows that 
= is bounded, and then E^AfkiW) = implies that W has Poisson point probabilities. 
Alternatively, for j e N U {0}, we could take f{k) = I[k = j] so that the EAf{W) = 
implies that 

XP{w=j-l)=jF{W=j), 

which is defining since is a non-negative integer-valued random variable. A third proof 
can be obtained by taking f[k) = e~^^, from which the Laplace transform of W can be 
derived. □ 

We now derive useful properties of the solutions fA of (4.1). 

Lemma 4.4. ///a solves (4.1), then 

UaW ^ min {1, A-^/2| ii^^^ii ^ 1^1^ ^ ^^-^ ^ (4 2) 

where Af{k) := f{k + 1) - f{k). 

Proof. The proof of Lemma 4.4 follows from careful analysis. We prove the second assertion 
and refer to [11] for further details. Upon rewriting 

fAik) = \-\k - i)!e^ [VM n u,)Vxm - 'PxiA n u^)Pxm] , 

some consideration leads us to observe that for j ^ 1, fj := f^jy satisfies 
• fj{k) ^ for A; ^ j and fj{k) ^ for A; > j, 
. Af^ik) ^ for A; ^ j, and A/,(j) ^ 0, 
. A/,(j)^min{j-i,(l-e-^)/A}. 
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And also Afo^k) < 0. Since 



is a sum of terms which are all negative except for at most one, we find 




(4.3) 



Since /a^ = —/a, (4.3) yields the second assertion. 



□ 



We can now state our main Poisson approximation theorem which follows from Corollary 
4.3 and Lemma 4.4. 

Theorem 4.5. Let J-" be the set of functions satisfying (4.2). If W is an integer-valued 
random variable with mean A and Z ~ Po(A), then 



We are ready to apply Theorem 4.5 to some examples, but first some remarks. Recall that 
our main strategy for normal approximation was to find some structure in W, the random 
variable of interest, that allows us to compare E[H^/(H^)] to E[/'(iy)] for appropriate /. 
The canonical such structures were 

1. Sums of independent random variables, 

2. Sums of locally dependent random variables, 

3. Exchangeable pairs, 

4. Size-biasing, 

5. Zero-biasing. 

Note that each of these structures essentially provided a way to break down E[iy/(iy)] 
into a functional of / and some auxiliary random variables. Also, from the form of the 
Poisson characterizing operator, we want to find some structure in W (the random variable 
of interest) that allows us to compare E[iy/(iy)] to AE[/(iy-|- 1)] for appropriate /. These 
two observations imply that the first four items on the list above may be germane to Poisson 
approximation, which is exactly the program we will pursue (since zero-biasing involves /', 
we won't find use for it in our discrete setting). 



d^yiW, Z) ^ sup |E[A/(W^ + 1) - Wf{W)] 



(4.4) 
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4.1 Law of small numbers 



It is well known that if Wn ~ Bi(?7,, X/n) and Z ~ Po(A) then d-TYiWu, — ?■ as n — )■ oo, 
and it is not difficult to obtain a rate of this convergence. From this fact, it is easy to believe 
that if Xi, . . . , Xn are independent indicators with P(Xj = 1) = Pi, then W = XliLi ^« ^^^^ 
be approximately Poisson if maxjpj is small. In fact, we will show the following result. 

Theorem 4.6. Let Xi, . . . , X„ independent indicators with P(Xj = 1) = pi, W = XliLi -^i' 
and X = E[W] =Y.iPi- UZ ^ Po(A), then 

n 

ciTv(l^,^) ^min{l,A-i}^p2 

i=l 

^ min{l. A} maxpj. 

i 

Proof. The second inequality is clear and is only included to address the discussion preceding 
the theorem. For the ffist inequality, we apply Theorem 4.5. Let / satisfy (4.2) and note 
that 

n 

i=l 
n 

= ^E[/(iy)|X, = l]P[X, = l] 

j=i 

n 

= Y,P^nf{W^ + l)l (4.5) 



where Wi = W — Xi and (4.5) follows since Xi is independent of Wi. Since Xf{W + 1) 
YliPifi^ + 1), we obtain 



\E[Xf{W + l)-Wf{W)]\ 



= ^p,E[/(iy + i)-/(iy, + i)] 

i=l 
n 

j=l 

n 

= min{l,A-i} J]p,E[X,], 



i=l 



where the inequality is by rewriting f{W + 1) — f{Wi + 1) as a telescoping sum of |iy — 
ffist differences of /. Combining this last calculation with Theorem 4.5 yields the desired 
result. □ 
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4.2 Dependency neighborhoods 

Analogous to normal approximation, we can generalize Theorem 4. 1 to sums of locally de- 
pendent variables [4, 5]. 

Theorem 4.7. Let Xi, . . . , Xn indicator variables with P(Xj = 1) = pi, W = Y17=i-^i> 
and A = Ei[W] = YliiPi- -^'^'^ ^^^^ h Ni C such that X^ is independent of 

{Xj : i ^ Ni}. Ifp.j := E[XiXj] and Z ~ Po(A), then 

(n n 
i=i jeNi i=i jeN,/{i} 

Remark 4.1. The neighborhoods Ni can be defined with greater fiexibility (i.e. dropping 
the assumption that Xi is independent of the variables not indexed by N^) at the cost of an 
additional error term that (roughly) measures dependence (see [4, 5]). 

Proof. We want to mimic the proof of Theorem 4.1 up to (4.5), the point where the hypothesis 
of independence is used. Let / satisfy (4.2), Wi = W — Xi, and Vi = J2j^N--^j- Since 
Xif{W) = Xif{Wi + 1) almost surely, we find 



E[\f{W + 1) - Wf{W)] = Y,P^nfiW + 1) - f{W, + 1)] 

1=1 

n 



(4.6) 
(4.7) 



i=l 



As in the proof of Theorem 4.1, the absolute value of (4.6) is bounded above by || A/|| ^jP^. 
Due to the independence of Xi and Vi, and the fact that E[Xi] = Pi, we find that (4.7) is 
equal to 

n 

J2 nip^ - x,){f{w, + 1) - fiv + 1))], 

1=1 

so that the absolute value of (4.7) is bounded above by 



l|A/||5^E 



i=l 



\p,-Xi\\Wi-V,\ 



i=l 



jeNJ{i} 



i^fwYl Yl (p^Pj+p^j) 

*=1 j&N,/{i} 



Combining these bounds for (4.6) and (4.7) yields the theorem. 



□ 
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4.2.1 Application: Head runs 

In this section we consider an example that arises in an apphcation from biology, that of 
DNA comparison. We postpone discussion of the details of this relation until the end of the 
section. 

In a sequence of zeroes and ones we call an occurrence of the pattern ■ ■ ■ Oil ■ ■ ■ 10 (or 
11 ■ • ■ 10 ■ ■ ■ or ■ • ■ Oil ■ ■ ■ 1 at the boundaries of the sequence) with exactly k ones a head 
run of length k. Let W be the number of head runs of length at least in a sequence of n 
independent tosses of a coin with head probability p. More precisely, let Yi, . . . , F„ be i.i.d. 
indicator variables with P{Yi = 1) = p and let 

k 

and for z = 2, . . . , n — A; + 1 let 



x, = (i-r,_i)J]r, 

3=0 



V 

l+j- 



Then Xi is the indicator that a run of ones of length at least k begins at position i in the 
sequence (Yi, . . . , Y„) so that we set W = '^^Ii'^^ X^. Note that the factor 1 — Yi^i is used 
to "de-clump" the runs of length greater than k so that we do not count the same run more 
than once. At this point we can apply Theorem 4.7 with only a little effort to obtain the 
following result. 

Theorem 4.8. Let W be the number of head runs of at least length k in a sequence of 
n independent tosses of a coin with head probability p as defined above. If X = Ej[W] = 
p''{{n- k){l-p) + 1) and Z ~ Po(A), then 

2k A- 1 

dMW, Z) ^ \^ ^— + 2Ap^ (4.8) 

n — k + 1 

Remark 4.2. Although Theorem 4.8 provides an error for all n,p,k, it can also be inter- 
preted asymptotically as n — )■ oo and A bounded away from zero and infinity. Roughly, 
if 

^ ^ log(ra(l -p)) ^ ^ 
log(l/p) 

for some constant c, then for fixed p, lim^^oo A = p'^. In this case the bound (4.8) is of order 
log(ri)/n. 
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Proof of Theorem 4-8. As discussed in the remarks preceding the theorem, W has represen- 
tation as a sum of indicators: W = Yl^=i~^^ -^i- "^^^ ^^'^^ that X is as stated follows from 
this representation using that E[Xi] = / and E[Xi] = {1 - for z 7^ 1. 

We will apply Theorem 4.7 with Ni = {1 ^ j ^ n — k + 1 : \i — j\ ^ k} which clearly 
has the property that Xi is independent of {Xj : j ^ Ni}. Moreover, if j G Ni/{i}, then 
E[XjXj] = since two runs of length at least k cannot begin within k positions of each 
other. Theorem 4.7 now implies 



n 



ciTv(W^, ^) ^ 2^ 2^ E[X,]E[X,]. 

It only remains to show that this quantity is bounded above by (4.8) which follows by 
grouping and counting the terms of the sum into those that contain E[Xi] and those that 
do not. □ 

A related quantity which is of interest in the biological application below is the length 
of the longest head run in n independent coin tosses. Due to the equality of events, we have 
'P{W = 0) = P(-Rn < fc), so that we can use Remark 4.2 to roughly state 



^ C 



\og{n) 



n 



The inequality above needs some qualification due to the fact that Rn is integer-valued, but 
it can be made precise - see [4, 5, 6] for more details. 

Theorem 4.8 was relatively simple to derive, but many embellishments are possible which 
can be also handled similarly, but with more technicalities. For example, for < a ^ 1, we 
can define a "quality a" run of length j to be a run of length j with at least aj heads. We 
could then take W to be the number of quality a runs of length at least k and i?„ to be the 
longest quality a run in n independent coin tosses. A story analogous to that above emerges. 

These particular results can also be viewed as elaborations of the classical theorem: 

Theorem 4.9 (Erdos-Renyi Law). If Rn is the longest quality a head run in a sequence of 
n independent tosses of a coin with head probability p as defined above, then almost surely, 

Rn 1 



log(n) H{a,p)' 

where forO < a < 1, H{a,p) = a\og{a/p) + {l—a) log((l— a)/(l— p)), and H{l,p) = log(l/p). 

Remark 4.3. Some of the impetus for the results above and especially their embellishments 
stems from an application in computational biology - see [4, 5, 6, 53] for an entry into this 
literature. We briefly describe this application here. 
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DNA is made up of long sequences of the letters A, G, C, and T which stand for certain 
amino acids. Frequently it is desirable to know how closely^ two sequences of DNA are 
related. 

Assume for simplicity that the two sequences of DNA to be compared both have length 
n. One possible measure of closeness between these sequences is the length of the longest run 
where the sequences agree when compared coordinate- wise. More precisely, if sequence A is 
A1A2 ■ ■ ■ An, sequence B is -B1-B2 ■ ■ ■ Bn, and we define Yi = I[Ai = Bi], then the measure of 
closeness between the sequences A and B would be the length of the longest run of ones in 
{Yu...,Yn). 

Now, given the sequences A and B, how long should the longest run be in order to 
consider them close? The usual statistical setup to handle this question is to assume a prob- 
abilistic model under the hypothesis that the sequences are not related, and then compute 
the probability of the event "at least as long a run" as the observed run. If this probability 
is low enough, then it is likely that the sequences are closely related (assuming the model is 
accurate) . 

We make the (likely unrealistic) assumption that sequences of DNA are generated as 
independent picks from the alphabet {A, G, G, T} under some probability distribution with 
frequencies pa, Pg: Pc: and pt- The hypothesis that the sequences are unrelated corresponds 
to the sequences being generated independently. 

In this framework, the distribution of the longest run between two unrelated sequences 
of DNA of length n is exactly -R„ above with p := P(Fj = 1) = Pa + Pg + + Pt- Thus 
the work above can be used to approximate tail probabilities of the longest run length under 
the assumption that two sequences of DNA are unrelated and then used to determine the 
likeliness of the observed longest run lengths. 

4.3 Size-bias Coupling 

The most powerful method of rewriting E[iy/(iy)] so that it can be usefully compared to 
E[iy]E[/(iy + 1)] is through the size-bias coupling already defined in Section 3.4 - recall the 
relevant definitions and properties there. The book [11] is almost entirely devoted to Poisson 
approximation through the size-bias coupling (although that terminology is not used), so we 
will spend some time fleshing out their powerful and general results. 

Theorem 4.10. Let W ^ an integer-valued random variable with E[H^] = A > and let 
VT** be a size-bias coupling ofW. If Z Po(A), then 

d^y{W, Z) ^ min{l, \}'E\W + l-W'\. 

^For example, whether the two sequences have a similar biological function or whether one sequence could 
be transformed to the other by few mutations. 



44 



Proof. Let / bounded and ||A/|| ^ min{l,A ^}. Then 

inxfiw + 1) - wfiw)]\ = X \nfiw + 1) - fiw')]\ 

^ A||A/||E|iy + 1 - W'\, 

where we have used the definition of the size-bias distribution and rewritten f {W +1) — f (W^) 
as a telescoping sum of |iy + 1 — W^\ terms. □ 

Due to the canonical "law of small numbers" for Poisson approximation, we will mostly 
be concerned with approximating a sum of indicators by a Poisson distribution. Recall the 
following construction of a size-bias coupling from Section 3.4, and useful special case. 

Corollary 4.11. Let Xi, . . . , X„ be indicator variables with P(Xj = 1) = pi, W = Yl^=i -^i' 
and X = Ej[W] = ^^Pi- If for each i = 1, . . . ,n, has the distribution of {Xj)j^i 

conditional on Xi = 1 and I is a random variable independent of all else such that P(/ = 
i) = Pi/X, then W'"^ = Xl^y/^j^^ + ^ size-bias distribution of X. 

Corollary 4.12. Let Xi, . . . ^X^ be exchangeable indicator variables and let {X^p)j^i have 
the distribution of {Xj)j^i conditional on Xi = 1. If W = Y17=i-^i' ^hen the size-bias 
distribution of X can be represented by X^ = Xljyi"^]^'' + ^■ 

Proof. Corollary 4.11 was proved in Section 3.4 and and Corollary 4.12 follows from the fact 
that exchangeability implies that I is uniform and X^^yj Xj^'^ -\- Xf = Xl^yi ^j^^ + • '-' 

Example 4.4 (Law of small numbers). Let W = Yl^=i-^i where the Xi are independent 
indicators with P(Xj = 1) = pi. According to Corollary 4.11, in order to size-bias W, we first 
choose an index / with P(/ = i) = Pi/X, where A = ^[W] = ^iPi- Given I = i we construct 
Xj^^ having the distribution of Xj conditional on Xi = 1. However, by independence, 

{Xj^^)j^i has the same distribution as {Xj)j^i so that we can take VV^ = ^j-^jXj + 1. 
Applying Theorem 4.10 we find that for Z ~ Po(A), 

n n 

dTviW, Z) ^ min{l, A}E[X,] = min{l. A} ^ jE[Xi] = min{l, X-^}Y,pI 

i=l i=l 

which agrees with our previous bound for this example. 

Example 4.5 (Isolated Vertices). Let W be the number of isolated vertices in an Erdos- 
Renyi random graph on n vertices with edge probabilities p. Note that W = 
where Xi is the indicator that vertex Vi (in some arbitrary but fixed labeling) is isolated. 
We constructed a size-bias coupling of W in Section 3.4 using Corollary 4.11, and we can 
simplify this coupling by using Corollary 4.12^ as follows. 

■'This simplification would not have yielded a useful error bound in Section 3.4 since the size-bias normal 
approximation theorem contains a variance term; there the randomization provides an extra factor of 1/ n. 
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We first generate an Erdos-Renyi random graph G, and then erase all edges connected to 
vertex Vi. Then take xj^-* be the indicator that vertex vj is isolated in this new graph. By 

the independence of the edges in the graph, it is clear that {Xj^^)j^i has the distribution of 

{Xj)j-ti conditional on Xi = 1, so that by Corollary 4.12, we can take W = E.vi^r + 1 
and of course we take W to be the number of isolated vertices in G. 

In order to apply Theorem 4.10, we only need to compute A = E[iy] and E|iy + 1 — W^]. 
From Example 3.9 in Section 3.4, A = n(l — and from the construction above 

E\W+1-W'\ = E 

= e[Xi] + ^e[x«-x; , 

where we use the fact that Xj^^ ^ Xj which follows since we can only increase the number 

of isolated vertices by erasing edges. Thus, xj^^ — Xj is equal to zero or one and the latter 
happens only if vertex Vj has degree one and is connected to vi which occurs with probability 
p{l — Putting this all together in Theorem 4.10, we obtain the following. 

Proposition 4.13. Let W the number of isolated vertices in an Erdos-Renyi random graph 
and A = E[W]. If Z ^ Po(A), then 

(iTv(W^, Z) ^ min{l. A} {{n - l)p{l - p)^-^ + (1 - p)""!) 

^min{A,An ( + -) ■ 
\1 — p n J 

To interpret this result asymptotically, if A is to stay away from zero and infinity as n 
gets large, p must be of order log(n)/n, in which case the error above is of order log(n)/n. 

Example 4.6 (Degree d vertices). We can generalize Example 4.5 by taking W to be the 
number of degree d ^ vertices in an Erdos-Renyi random graph on n vertices with edge 
probabilities p. Note that W = Y17=i-^i^ where Xi is the indicator that vertex Vi (in some 
arbitrary but fixed labeling) has degree d. We can construct a size-bias coupling of W by 
using Corollary 4.12 as follows. Let G be an Erdos-Renyi random graph. 

• If the degree of vertex vi is di ^ d, then erase di—d edges chosen uniformly at random 
from the di edges connected to vi. 

• If the degree of vertex vi is di < d, then add edges from vi to the d — di vertices not 
connected to vi chosen uniformly at random from the n — di — 1 vertices unconnected 
to Vi. 



n 

Xi + J]X,-Xj 

J=2 
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Let Xj be the indicator that vertex Vj has degree d in this new graph. By the independence 

of the edges in the graph, it is clear that (xj^^)j^i has the distribution of {Xj)j^i conditional 

on Xi = 1, so that by Corollary 4.12, we can take W = + 1 and of course we 

take W to be the number of isolated vertices in G. 

Armed with this coupling, we could apply Theorem 4.10 to yield a bound in the variation 
distance between W and a Poisson distribution. However, the analysis for this particular 
example is a bit technical, so we refer to Section 5.2 of [11] for the details. 



4.3.1 Increasing size-bias couplings 

A crucial simplification occurred in Example 4.5 because the size-bias coupling was increasing 
in a certain sense. The following result quantifies this simplification. 

Theorem 4.14. Let Xi, . . . , X„ be indicator variables with P(Xj = 1) = pi, W = Y17=i -^i' 
and X = E[iy] = XljP*- -^^'^ each i = 1,. . . ,n, let {Xj^^)j^i have the distribution of {Xj)j^i 
conditional on X^ = 1 and let I be a random variable independent of all else, such that 
p(/ = i) = P-/X so that W = Y^j^i^f^ + 1 has the size-bias distribution of W . If 
X-*"* ^ Xj for all i j , and Z ~ Po(A) , then 



d^^{W, Z) ^ min{l, A"^} ( Var(W^) - A + 2 ^ 



i=l 



Proof. Let Wi = ^^yjXj*'' + 1. From Theorem 4.10 and the size-bias construction of Corol- 
lary 4.11, we have 

n 

dTviW, Z) ^ min{l, X~^}^pi^W + l-Wi\ 



4 = 1 

n 



min{l,A-^}^ PiE 



1=1 

n 



$:(Af-x,)+x. 



min{l, A^i} J]piE [Wi-W-l + 2Xi] , (4.9) 



1=1 



where the penultimate equality uses the monotonicity of the size-bias coupling. Using again 
the construction of the size-bias coupling we obtain that (4.9) is equal to 



mm 



{l,A-i} AE[iy^]-A2-A + 2X^p2 



i=l 

which yields the desired inequality by the definition of the size-bias distribution. □ 
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4.3.2 Application: Subgraph counts 



Let G = G{n,p) be an Erdos-Renyi random graph on n vertices with edge probability p and 
let if be a graph on < vh ^ n vertices with en edges and no isolated vertices. We want to 
analyze the number of copies of H in G; that is, the number of subgraphs of the complete 
graph on n vertices which are isomorphic to H which appear in G. For example, we could 
take if to be a triangle so that vh = en = <i- 

Let r be the set of all copies of H in Kn, the complete graph on n vertices and for a G F, 
let Xa be the indicator that there is a copy of ii in G at a and set W = ^^tsr -^a- We now 
have the following result. 

Theorem 4.15. Let W be the number of copies of a graph H with no isolated vertices in G 
as defined above and let A = E[iy]. If H has ch edges and Z ~ Po(A), then 



Proof. We will show that Theorem 4.14 applies to W. Since W = X^aer^^a is a sum of 
exchangeable indicators, we can apply Corollary 4.12 to construct a size-bias coupling of W. 
To this end, for a fixed a G F, let xj^"^ be the indicator that there is a copy of if in G U {a} 
at p. Here, G U {a} means we add the minimum edges necessary to G to have a copy of H 
at a. The following three evident facts now imply the theorem: 

1. {X'^^)i3^a has the distribution of {Xi3)i^, given that X^ = 1. 



Theorem 4.15 is a very general result, but it can be difficult to interpret. That is, what 
properties of a subgraph H make W approximately Poisson? We can begin to answer that 
question by expressing the mean and variance of W in terms of properties of H which yields 
the following. 

Corollary 4.16. Let W be the number of copies of a graph H with no isolated vertices in 
G as defined above and let A = E[l^]. For fixed a G F, let F^ C F 6e the set of subgraphs of 
Kn isomorphic to H with exactly t edges not in a. If H has en edges and Z ~ Po(A), then 



dTY{W, Z) ^ min{l, A"^} (Var(l^) - A + 2\p''") . 



3. 



2. 



For all P G F/{a}, X'f^ ^ X, 



□ 
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Proof. The corollary follows after deriving the mean and variance of W. The terms |r^| 
account for the number of covariance terms for different types of pairs of indicators. In 
detail, 

Var(W^) = J2 Var(X,) + Cov(X„, X^) 

en 

aer i=i iseri 

= A(l - + f; ^ (E[X^|X„ = 1] - p^") 

aer t=i /3Gr^ 

/ ea-l 

since A = T^aevP^" and for /3 G T*^, E[X^|X^ = 1] = □ 

It is possible to rewrite the error in other forms which can be used to make some general 
statements (see [11], Chapter 5), but we content ourselves with some examples. 

Example 4.7 (Triangles). Let if be a triangle. In this case, en = 3, |r^| = 3(n — 3), and 
\^a\ — since triangles either share one edge or all three edges (corresponding to t = 2 
and t = 0). Thus Corollary 4.16 implies that for W the number of triangles in G and Z an 
appropriate Poisson variable, 

d^yCW, Z) ^ min{l. A} (^=^ + 3(^-3)^2(1-^)). (4.10) 

Since A = (3)^^ we can view (4.10) as an asymptotic result with p of order 1/n. In this case, 
(4.10) is of order 1/n. 

Example 4.8 (k-cycles). More generally, we can let if be a /c-cycle (a triangle is 3-cycle). 
Now note that for some constants q and Ck, 

since we choose the k — t edges shared in the /c-cycle a, and then we have order n*~^ sequences 
of vertices to create a cycle with t edges outside of the k — t edges shared with a. The second 
equality follows by maximizing (^^.'^Jcf over the possible values of t. We can now find for W 
the number of /c-cycles in G and Z an appropriate Poisson variable, 

rfTv(M/,Z) ^min{l,A} + CkpY.M'"'] ■ (4.11) 
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To interpret this bound asymptotically, we note that A = \T\p and 

ir| = f")(^/. 



kj 2 

since the number of non-isomorphic /c-cycles on is k\/{2k) (since k\ is the number of 
permutations of the vertices, which over counts by a factor of 2k due to reflections and 
rotations). Thus A is of order (np)^ for fixed k so that we take p to be of order 1/n and in 
this regime (4.11) is of order 1/n. 

Similar results can be derived for induced and isolated subgraph counts - again we refer 
to [11], Chapter 5. 

4.3.3 Implicit coupling 

In this section we show that it can be possible to apply Theorem 4.14 without constructing 
the size-bias coupling explicitly. We first need some terminology. 

Definition 4.9. We say a function / : R" — )■ R, is increasing (decreasing) if for all x = 
(xi, . . . , Xn) and y = [yi, . . .,yn) such that Xi ^ yi for alH = 1, . . . , n, we have /(x) ^ /(y) 

(/(x) > /(y)). 

Theorem 4.17. Let Y = {Yi)jL^ be a finite collection of independent indicators and as- 
sume Xi, . . . ,Xn are increasing or decreasing functions from {0, 1}^ into {0, 1}. If W = 
^"^^Xi( Y) and E[W] = \, then 

dTy{W, Z) < min{l, A~^} |^Var(l^) - A + 2 

Proof. We will show that there exists a size-bias coupling of W satisfying the hypotheses 
of Theorem 4.14, which implies the result. From Lemma 4.18 below, it is enough to show 
that Cov(Xj(Y), o X(Y)) ^ for all increasing indicator functions ip. However, since each 
Xi{Y) is an increasing or decreasing function applied to independent indicators, then so is 
(y9oX(Y). Thus we may apply the FKG inequality (see Chapter 2 of [37]) which in this case 
states 

E[X,(Y)(p o X(Y)] ^ E[X,(Y)]E[(p o X(Y)], 
as desired. □ 

Lemma 4.18. Let X = (Xi,...,X„) be a vector of indicator variables and let X*-*-* = 
{x'i \ . . . , Xn^) = X\Xi = 1. Then the following are equivalent: 
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1. There exists a coupling such that xj*-* ^ Xj. 

2. For all increasing indicator functions (p, E[(/?(X^*^)] ^ E[<y5(X)]. 

3. For all increasing indicator functions if, Cov(Xj, ^ 0. 

Proof. The equivalence l-v^2 follows from a general version of Strassen's theorem which can 
be found in [37]. In one dimension, Strassen's theorem says that there exists a coupling of 
random variables X and Y such that X ^ Y ii and only if Fx{z) ^ Fy{z) for all 2 G IR 
where Fx and Fy are distribution functions. 

The equivalence 2-<=^3 follows from the following calculation. 

E[<^(X«)] = E[^(X)|X, = 1] = E[X,^(X)|X, = 1] = 

□ 

Example 4.10 (Subgraph counts). Theorem 4.17 applies to the example of Section 4.3.2, 
since the indicator of a copy of if at a given location is an increasing function of the edge 
indicators of the graph G. 

Example 4.11 (Large degree vertices). Let (i ^ and let W be the number of vertices 
with degree at least d. Clearly is a sum of indicators that are increasing functions 
of the edge indicators of the graph G so that Theorem 4.17 can be applied. After some 
technical analysis, we arrive at the following result - see Section 5.2 of [11] for details. If 

?i = Efe^d ("fe>'(l -P)"-'-' and Z ~ Po(ngi), then 



{n - l)pqi 

Example 4.12 (Small degree vertices). Let d ^ and let W be the number of vertices 
with degree at most d. Clearly is a sum of indicators that are decreasing functions 
of the edge indicators of the graph G so that Theorem 4.17 can be applied. After some 
technical analysis, we arrive at the following result - see Section 5.2 of [11] for details. If 

?2 = Efe^d ("fc>'(l -P)"-'-' and Z ~ Po(ng2), then 



(n-l)(l-p)g2 
4.3.4 Decreasing size-bias couplings 

In this section we prove and apply a result complementary to Theorem 4.14. 
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Theorem 4.19. Let Xi, . . . , X„ be indicator variables with P(Xj = 1) = pi, W = Yl^=i -^i' 



and X = Ej[W] = ^^Pi- For each i = 1, . . . ,n, let {Xj^^)j^i have the distribution of (Xj)j^j 
conditional on Xi = 1 and let I be a random variable independent of all else, such that 
p(J = i) = pi/X so that W = Y.j^iXf'^ + 1 has the size-bias distribution of W . If 
xj*"* ^ Xj for all i ^ j , and Z ~ Po(A), then 

dTYiW, Z) ^ min{l, A} - Y^^M^ . 

Proof. Let Wi = Yljj^i-^j^^ + 1- From Theorem 4.10 and the size-bias construction of Corol- 
lary 4.11, we have 

n 

dTY{W, Z) ^ min{l, A"^} ^pi¥.\W + l-Wi\ 



1=1 

n 



min{l,A-i}^ 



i=l 



min{l,A^^}^PiE[iy-l^, + l], (4.12) 



i=l 



where the penultimate equality uses the monotonicity of the size-bias coupling. Using again 
the construction of the size-bias coupling we obtain that (4.12) is equal to 

min{l, A"^} (A^ - AE[1^"] + A) , 

which yields the desired inequality by the definition of the size-bias distribution. □ 

Example 4.13 (Hypergeometric distribution). Suppose we have balls in an urn in which 
1 ^ n ^ N are colored red and we draw 1 ^ m ^ N balls uniformly at random without 
replacement so that each of the (^) subsets of balls is equally likely. Let W be the number 
of red balls. It is well known that if is large and m/N is small, then W has approximately 
a binomial distribution since the dependence diminishes. Thus, we would also expect W to 
be approximately Poisson distributed if in addition, n/N is small. We can use Theorem 4.19 
to make this heuristic precise. 

Label the red balls in the urn arbitrarily and let Xi be the indicator that ball i is 
chosen in the m-sample so that we have the representation W = Yl^=i -^i- Since the Xj 
are exchangeable, we can use Corollary 4.12 to size-bias W. If the ball labelled one already 
appears in the m-sample then do nothing, otherwise, we force Xf = 1 by adding the ball 
labelled one to the sample and putting back a ball chosen uniformly at random from the initial 
m-sample. If we let X^-^^ be the indicator that ball i is in the sample after this procedure. 
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then it is clear that {xl^^)i^2 has the distribution of {Xi)i^2 conditional on Xi = 1 so that 

we can take W = Eo2^i^^ + ^■ 

In the construction of above, no additional red balls labeled 2, . . . ,n can be added 
to the m-sample. Thus X^^'' ^ Xi for i ^ 2, and we can apply Theorem 4.19. A simple 
calculation yields 

^r^Tn nm , nm(N — n)(n — m) 

E[W] = — and Var(W^) = ^^^^^^^ \ 



so that for Z ~ Po{nm/N), 

rfTv(W^,Z)^min{l,^}(^- 



n m nm 



N-1 N-1 N{N-1) N 

Remark 4.14. This same bound can be recovered after noting the well known fact that 
W has the representation as a sum of independent indicators (see [42]) which implies that 
Theorem 4.19 can be applied. 

Example 4.15 (Coupon collecting). Assume that a certain brand of cereal puts a toy in 
each carton. There are n distinct types of toys and each week you pick up a carton of this 
cereal from the grocery store in such a way as to receive a uniformly random type of toy, 
independent of the toys received previously. The classical coupon collecting problem asks 
the question of how many cartons of cereal you must pick up in order to have received all n 
types of toys. 

We formulate the problem as follows. Assume you have n boxes and k balls are tossed 
independently into these boxes uniformly at random. Let W be the number of empty boxes 
after tossing all k balls into the boxes. Viewing the n boxes as types of toys and the k balls 
as cartons of cereal, it is easy to see that the event {W = 0} corresponds to the event that 
k cartons of cereal are sufficient to receive all n types of toys. We will use Theorem 4.19 
to show that W is approximately Poisson which will yield an estimate with error for the 
probability of this event. 

Let Xi be the indicator that box i (under some arbitrary labeling) is empty after tossing 
the k balls so that W = Yl^=i -^i- Since the Xi are exchangeable, we can use Corollary 4.12 
to size-bias W by first setting X^ = 1 by emptying box 1 (if it is not already empty) and 
then redistributing the balls in box 1 uniformly among boxes 2 through n. If we let X^^^ be 
the indicator that box i is empty after this procedure, then it is clear that (X,^^^)j^2 has the 
distribution of {Xi)i^2 conditional on Xi = 1 so that we can take = ^i^2-^i^^ + 

In the construction of above we can only add balls to boxes 2 through n, which implies 

^ < Xi for O 2 so that we can apply Theorem 4.19. In order to apply the theorem we 
only need to compute the mean and variance of W. First note that P(Xj = 1) = {{n—l)/n)^ 
so that 

X ■= E[W] =n(l-- 
\ n 
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and also that for i ^ j, P(Xj = l,Xj = 1) = ((n — 2)/nY so that 



Var(l^) = A 



n 



n(n 



n 



2k 



Using these calculations in Theorem 4.19 yields that for Z ~ Po(A), 



dTy{W, Z) ^ min{l, A} ( ( 1 - ^ ) + (n - 1] 



1-1 

n 



n — 1 



(4.13) 



In order to interpret this result asymptotically, let /c = n log(n) — cn for some constant 
c so that A = e'^ is bounded away from zero and infinity as n — )■ oo. In this case (4.13) is 
asymptotically of order 



A 



+ A 



n 



n[n 



{n - ly 



and since for 0<a^l, 1 — a^^ — log(a)x and also log(l + x) ^ x, we find 



k log(n) — c 

n{n - 2)7 " n{n - 2) ~ (n - 2) ' 



which implies 



dTy{W,Z) ^ C 



log(n) 



n 



Example 4.16 (Coupon collecting continued). We can embellish the coupon collecting 
problem of Example 4.15 in a number of ways; recall the notation and setup there. For 
example, rather than distribute the balls into the boxes uniformly and independently, we 
could distribute each ball independently according to some probability distribution, say pi 
is the chance that a ball goes into box i with Y17=iPi ~ ^- Note that Example 4.15 had 
Pi = 1/n for all i. 

Let Xi be the indicator that box i (under some arbitrary labeling) is empty after tossing 
k balls and W = Y17=i-^i- ■'■^ ^^^^ setting, the are not necessarily exchangeable so that 
we use Corollary 4.11 to construct W^. First we compute 



A:= E[I^] = J](l-p,)^ 



i=l 



and let / be a random variable such that P(/ = i) = {1 — piY /\. Corollary 4.11 now states 
that in order to construct VT*, we empty box / (forcing Xj = 1) and then redistribute the 
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balls that were removed into the remaining boxes independently and with chance of landing 
in box j equal to Pj/{1 — Pi)- If we let X-^^ be the indicator that box j is empty after 



this procedure, then it is clear that {Xj^^)j^j has the distribution of {Xj)j^j conditional on 



X/ = 1 so that we can take W = Y^j^i^T + ^• 

Analogous to Example 4.15, xj*"* ^ Xj for j ^ i so that we can apply Theorem 4.19. 
The mean and variance are easily computed, but not as easily interpreted. A little analysis 
(see [11] Section 6.2) yields that for Z ~ Po(A), 



dTy{W,Z) ^ min{l,A} 



max(l -pi)" + 



k f Alog(fc) 4 
A U-log(A) ^ k 



Remark 4.17. We could also consider the number of boxes with at most m ^ balls; we 
just studied the case m = 0. The coupling is still decreasing because it can be constructed 
by first choosing a box randomly and if it has greater than m balls in it, redistributing a 
random number of them among the other boxes. Thus Theorem 4.19 can be applied and an 
error in the Poisson approximation can be obtained in terms of the mean and the variance. 
We again refer to Chapter 6 of [11] for the details. 

Finally, could also consider the number of boxes containing exactly m ^ balls and 
containing at least m ^ 1 balls. Theorem 4.14 applies to the latter problem. See Chapter 6 
of [11]. 



4.4 Exchangeable Pairs 

In this section, we will develop Stein's method for Poisson approximation using exchangeable 
pairs as detailed in [20]. The applications for this theory are not as developed as that of 
dependency neighborhoods and size-biasing, but the method fits well into our framework 
and the ideas here prove useful elsewhere [47]. 

As we have done for dependency neighborhoods and size-biasing, we could develop ex- 
changeable pairs for Poisson approximation by following our development for normal ap- 
proximation which involved rewriting E[iy/(H^)] as the expectation of a term involving the 
exchangeable pair and /. However, this approach is not as useful as a different one which 
has the added advantage of removing the a-Stein pair linearity condition. 

Theorem 4.20. Let W he a non-negative integer valued random variable and let {W, W) 
he an exchangeable pair. If is a sigma-field with a{W) C J^, and Z ~ Po(A), then for all 
ceR, 

dTy{W,Z) ^ mm{l,\-^/^}(E\\- cP{W' = W + \ + E\W - cF{W' = PF- l|jr)|^. 
Before the proof, a few remarks. 



55 



Remark 4.18. Typically c is chosen to be approximately equal to X/P(W' = W + 1) = 
X/P{W' = ly — 1) so that the terms in absolute value have a small mean. 

Remark 4.19. Similar to exchangeable pairs for normal approximation, there is a stochastic 
interpretation for the terms appearing in the error of Theorem 4.20. We can define a birth- 
death process on N U {0} where the birth rate at state k is a{k) = A and death rate at 
state k is /3(/c) = k. This birth-death process has a Po(A) stationary distribution so that the 
theorem says that if there is a reversible Markov chain with stationary distribution equal to 
the distribution of W such that the chance of increasing by one is approximately proportional 
to some constant A and the chance of decreasing by one is approximately proportional to 
the current state, then W will be approximately Poisson. 

Proof. As usual, we want to bound |E[A/(iy + 1) — for functions / such that 

11/11 ^ min{l,A-i/2} and ||A/|| ^ {l,A-i}. Now, the function 

F{w, w') = I[w' = w + l]f{w') - I[w' = w- l]f{w), 

is anti-symmetric, so that E[-F(H^, W')] = 0. Moreover, by conditioning on J-" we obtain that 
for all c e R, 

cE [F{W' = W + l\J^)f{W + 1) - F{W' = W - l\J^)f{W)] = 0. 

which implies that 

E[Xf{W+l) - Wf{W)] = 

E [(A - cP{W' = W+ 1|J^)) f{W +1)-{W - cP{W' = W - f{W)] . (4.14) 

Taking the absolute value and applying the triangle inequality yields the theorem. □ 

Example 4.20 (Law of small numbers). Let W = XliLi where the Xj are independent 
indicators with P(Xj = 1) = Pi and define W = W — Xj + Xj, where / is uniform on 
{1, . . . , 77,} independent of W and X[, . . . , X'^ are independent copies of the Xi independent 
of each other and all else. It is easy to see that [W, W) is an exchangeable pair and that 

1 " 

p(w' = w + i|(x,).>i) = - V(i - x,)p„ 

n ^-^ 

1 " 

P(iy' = w- i|(x,),^i) = - Vx,(i 

SO that Theorem 4.20 with c = n yields 



n ^ — ' 

i=l 



dTv{W, Z) ^ min{l, X~^'^] ( E 



Y^p,-Y^[l-X,)p, 



1=1 1=1 



+ E 



Y^x,-Y,x,{i-p, 

i=l i=l 
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2mm{l, A"^/2}E 



2min{l,A^i/2}^ 



Pi 



This bound is not as good as that obtained using size-biasing (for example), but the better 
result can be recovered with exchangeable pairs by bounding the absolute value of (4.14) 
directly. 

Example 4.21 (Fixed points of permutations). Let vr be a permutation of {1, . . . ,n} chosen 
uniformly at random, let r be a uniformly chosen random transposition, and let it' = tit. 
Let W be the number of fixed points of vr and W be the number of fixed points of vr'. Then 
(VF, W) is exchangeable and if W2 is the number of transpositions when vr is written as a 
product of disjoint cycles, then 

n-W -2W2 



V{W' = W + l\-K 
p(iy' = w 



1 vr 



W{n - W) 



To see why these expressions are true, note that in order for the number of fixed points of a 
permutation to increase by exactly one after multiplication by a transposition, a letter must 
be fixed that is not already and is not in a transposition (else the number of fixed points 
would increase by two). Similar considerations lead to the second expression. 

By considering W as a. sum of indicators, it is easy to see that E[Vr] = 1 so that applying 
Theorem 4.20 with c = (n — l)/2 yields 



d^viW^Z) ^ E 
1 



n 



W -2W2 



n 



E 



W 



W{n - W) 



n 



n 



¥j\W + 2W2] + -E[iy^ 



n 



4/n, 



where the final inequality follows by considering W and W2 as a sum of indicators which 
leads to E[H^^] = 2 and E[iy2] = 1/2- As is well known, the true rate of convergence is 
much better than order 1/n; it is not clear how to get a better rate with this method. 

We could also handle the number of i-cycles in a random permutation, but the analysis 
is a bit more tedious and is not worth pursuing due to the fact that so much more is known 
in this example - see [2] for a thorough account. 



5 Exponential approximation 

In this section we will develop Stein's method for bounding the Wasserstein distance (see 
Section 1.1.1) between a distribution of interest and the Exponential distribution. We will 
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move quickly through the material analogous to that of Section 2 for normal approximation, 
as the general framework is similar. Our treatment follows [40] closely; alternative approaches 
can be found in [21]. 

Definition 5.1. We say that a random variable Z has the exponential distribution with 
rate A, denoted Z ~ Exp(A) if Z has density Ae~'^^ for 2 > 0. 

Lemma 5.1. Define the junctional operator A by 

Af{x) = f{x)-f{x) + f{0). 

1. If Z ^ Exp(l), then ¥jAf{Z) = for all absolutely continuous f with E|/'(Z)| < 00. 

2. If for some non-negative random variable W , WjAfiW) = for all absolutely continu- 
ous f with E\f'{Z)\ < 00, then W ~ Exp(l). 

The operator A is referred to as a characterizing operator of the exponential distribution. 

Before proving the lemma, we state one more result and then its consequence. 

Lemma 5.2. Let Z ~ Exp(l) and for some function h let fh be the unique solution of 

fUx)-fh{x) = h{x)-E[h{Z)] (5.1) 

such that fh{0) = 0. 

1. If h is non-negative and bounded, then 

IIMK 11/^11 and m\^2\\h\\. 

2. If h is absolutely continuous, then 

\\f',\\^\\h'\\ and \\fa^2\\h'\\. 

Analogous to normal approximation, this setup immediately yields the following promis- 
ing result. 

Theorem 5.3. Let W ^ be a random variable with finite mean and Z ~ Exp(l). 
1. If is the set of functions with \\f'\\ ^ 1, ||/"|| ^ 2, and /(O) = 0, then 

d^^{W,Z) ^ sup \E[f'{W)-f{W)]\. 
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2. If J-'k is the set of functions with \\f\\ ^ 1, ||/'|| ^ 2, and /(O) = 0, then 

dK{W,Z) ^ sup \E[f'{W)-fiW)]\. 

Proof of Lemma 5.2. Writing h{t) := h{t) — E[/i(Z')], the relation (5.1) can easily be solved 
to yield 



fh{x) = -e" / h{t)e-'dt. (5.2) 

J X 

1. If is bounded, then (5.2) implies that 

POO 

\fhix)\^e' / \hit)\e-'dt ^ \\h\\. 



Since fh solves (5.1), we have 

\fl{x)\ = \f,ix) + hix)\^\\f,\\ + \\h\\^2\\h\\, 
where we have used the bound on \\fh\\ above and that h is non- negative. 

2. If h is absolutely continuous, then by the form of (5.2) it is clear that fh is twice 
different iable. Thus we have that fh satisfies 

ft{x)-f'h{x) = h'{x), 

so that we can use the arguments of the proof of the previous item to establish the 
bounds on and ||/^'||. 

□ 

Proof of Lemma 5.1. Item 1 essentially follows by integration by parts. More formally, for 
/ absolutely continuous, we have 

nnZ)] = / f'{t)e-'dt = / fit) / e-^'dxdt 
Jo Jo Jt 

fit)dtdx = E[fiZ)]-f{0), 



oo 



as desired. For the second item, assume that W ^ satisfies E[/'(l^)] = E[/(Vr)] — /(O) for 
all absolutely continuous / with E|/'(Z)| < oo. The functions f{x) = x^ are in this family, 
so that 

and this relation determines the moments of W as those of an exponential distribution with 
rate one, which satisfy Carleman's condition (using Stirling's approximation). Alternatively, 
the hypothesis on W also determines the Laplace transform as that of an exponential variable. 

□ 
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It is clear from the form of the error in Theorem 5.3 that we want to find some structure 
in W, the random variable of interest, that allows us to compare E[/(iy)] to E[/'(W^)] 
for appropriate /. Unfortunately the tools we have previously developed for the analogous 
task in Poisson and Normal approximation will not help us directly here. However, we will 
be able to define a transformation amenable to the form of the exponential characterizing 
operator which will prove fruitful. An alternative approach (followed in [21, 22]) is to use 
exchangeable pairs with a modified a-Stein condition. 

5.1 Equilibrium coupling 

We begin with a definition. 

Definition 5.2. Let W ^ a random variable with ¥i[W] = /i. We say that has the 
equilibrium distribution with respect to W if 

nfiW)] - /(O) = fiE[f'{W')] (5.3) 

for all Lipschitz functions /. 

We will see below that the equilibrium distribution is absolutely continuous with re- 
spect to Lebesgue measure, so that the right hand side of (5.3) is well defined. Before this 
discussion, we note the following consequence of this definition. 

Theorem 5.4. Let W ^ a random variable with Fj[W] = 1 and Fj[W'^] < oo. IfW^ has 
the equilibrium distribution with respect to W and is coupled to W, then 

d^{W,Z) ^ 2¥.\W^-W\. 

Proof. Note that if /(O) = and ¥.[W] = 1, then E[f{W)] = E[f'{W^)] so that for / with 
bounded first and second derivative and such that /(O) = 0, we have 

mfm - f{w)]\ = \E[f'{w) - f'iw')]\ ^ iiriiE|iy^ - w\. 

Applying Theorem 5.3 now proves the desired conclusion. □ 

We now state a constructive definition of the equilibrium distribution which will also be 
useful later. 

Proposition 5.5. Let W be a random variable with E[W] = fi and let have the 
size-bias distribution ofW. If U is uniform on the interval (0,1) and independent ofW^, 
then UW^ has the equilibrium distribution ofW. 
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Proof. Let / be Lipschitz with /(O) = 0. Then 
E[f'{UW')] = E 

where in the final equahty we use the definition of the size-bias distribution. □ 

Remark 5.3. This proposition shows that the equihbrium distribution is the same as that 
from renewal theory. That is, a renewal process in stationary with increments distributed as 
a random variable Y is given hj + Yi + ■ ■ ■ + Yn, where Y'^ has the equilibrium distribution 
of Y and and is independent of the i.i.d. sequence Yi, Y2, . . . 

We will use Theorem 5.4 to treat some less trivial applications shortly, but first we handle 
a canonical exponential approximation result. 

Example 5.4 (Geometric distribution). Let be geometric with parameter p with positive 
support (denoted ~ Ge(p)), specifically, P(A^ = k) = {1 — for k ^ 1. It is well 

known that as p — )■ 0, pN converges weakly to an exponential distribution; this fact is not 
surprising as a simple calculation shows that if Z ~ Exp(A), then the smallest integer no 
greater than Z is geometrically distributed. We can use Theorem 5.4 above to obtain an 
error in this approximation. 

A little calculation shows that A^ has the same distribution as a variable which is uniform 
on {1, . . . , A^*}, where A^* has the size-bias distribution of A^ (heuristically this is due to the 
memoryless property of the geometric distribution - see Remark 5.3). Thus Proposition 5.5 
implies that for U uniform on (0, 1) independent of A^, N — U has the equilibrium distribution 
of A^. 

It is easy to verify that for a constant c and a non-negative variable X, {cXy = cX^, so 
that if we define W = pN our remarks above imply that := W —pU has the equilibrium 
distribution with respect to W. We now apply Theorem 5.4 to find that for Z ~ Exp(l), 

dy^{W,Z) ^ 2E[pU] =p. 
5.2 Application: Geometric sums 

Our first application is a generalization of the following classical result which in turn gener- 
alizes Example 5.4. 

Theorem 5.6 (Renyi's Theorem). Let Xi,X2,... be an i.i.d. sequence of non-negative 
random variables with ElXi] = 1 and let N ~ Ge(p) independent of the Xi. IfW =pX]ili^« 
and Z ~ Exp(l), then 

\imdK(W,Z) = 0. 

p— >-0 



f{uW')du 



Jo 



E 



fi-'E[fiW)], 
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The special case where = 1 is handled in Example 5.4; intuitively, the example can 
be generalized because for p small, is large so that the law of large numbers implies 
that YliLi is approximately equal to A^. We will show the following result which implies 
Renyi's Theorem. 

Theorem 5.7. Let Xi,X2, . . . be square integrahle, non-negative, and independent random 
variables with E[Xj] = 1. Let N > be an integer valued random variable with E[N] = 1/p 
for some < p ^ 1 and let M be defined on the same space as N such that 



If W = pYli=i-^i' ^ ~ Exp(l), and is an equilibrium coupling of Xi independent of 
N, M , and {Xj)j^i, then 



where /i2 := sup^EfX^^]. 

Before the proof of the theorem, we make a few remarks. 

Remark 5.5. The theorem can be a little difficult to parse on first glance, so we make a 
few comments to interpret the error. The random variable M is a discrete version of the 
equilibrium transform which we have already seen above in Example 5.4. More specifically, 
it is easy to verify that if A^* has the size-bias distribution of A^, then M is distributed 
uniformly on the set {1, . . . , A^'^}. If A^ ~ Ge{p), then we can take M = A^ so that the final 
term of the error of (5.4) and (5.5) is zero. Thus E|A^ — M\ quantifies the proximity of the 
distribution of A^ to a geometric distribution. We will formalize this more precisely when we 
cover Stein's method for geometric approximation below. 

The first term of the error in (5.5) can be interpreted as a term measuring the regularity of 
the Xi. The heuristic is that the law of large numbers needs to kick in so that YliLi ~ ^) 
and the theorem shows that in order for this to occur, it is enough that the Xj's have 
uniformly bounded variances. Moreover, the first term of the error (5.4) shows that the 
approximation also benefits from having the Xi be close to exponentially distributed. In 
particular, if all of the Xi are exponential and A^ is geometric, then the theorem shows 
dw{W, Z) = which can also be easily verified using Laplace transforms. 

Remark 5.6. A more general theorem can be proved with a little added technicality which 
allows for the Xj to have different means and for the Xi to have a certain dependence - 
see [40]. 



P(M = m) = pF{N ^ m). 



dw{W, Z) <: 2p {E\Xm - XIj\ + E\N - M\) 



(5.4) 
(5.5) 
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Proof. We will show that 



4=1 



is an equilibrium coupling of W. From this point Theorem 5.4 implies that 



2pE 



MVN 



XIj-Xm + sgn(M - TV) ^ X, 

i=MAN+l 

^2pE [\XI,-Xm\ + \N-M\], 
which proves (5.4). The second bound (5.5) follows from (5.4) after noting that 

E[\Xl, - Xm\\M] ^ E[Xl,\M] + E[Xm\M] 

= lml,\M] + l<:^f + l, 

where the equality follows from the definition of the equilibrium coupling. 
It only remains to show (5.6). Let / be Lipschitz with /(O) = and define 



g[m) 



i=l 



On one hand, using independence and the defining relation of X^, we obtain 



E 



f'[pY,X,+pXl, 



i=l 



M 



p-'E[g{M)-giM-l)\M], 



(5.6) 



and on the other, the definition of M implies 

p-'E[9{M) - g{M - mX,)^^,] = E[g{N)\{X, 
so that ahogether we obtain E[/'(iy")] = E[^(iV)] = E[f{W)], as desired. □ 

5.3 Application: Critical branching process 

In this section we obtain an error in a classical theorem of Yaglom pertaining to the generation 
size of a critical Galton- Watson branching process conditioned on non-extinction. We will 
attempt to have this section be self-contained, but it will be helpful to have been exposed to 
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the elementary properties and definitions of brancliing processes, found for example in the 
first chapter of [7]. 

Let Zq = 1, and Zi be a non-negative integer valued random variable with finite mean. 
For i,j ^ 1, let Zij be i.i.d. copies of Zi and define for n ^ 1 



We think of of Z„ as the generation size of a population that initially has one individual 
and where each individual in a generation has a Zi distributed number of offspring (or 
children) independently of the other individuals in the generation. We also assume that all 
individuals in a generation have offspring at the same time (creating the next generation) 
and die immediately after reproducing. 

It is a basic fact [7] that if E[Zi] ^ 1 and P{Zi = 1 < 1) then the population almost 
surely dies out, whereas if E[Zi] > 1, then the probability the population lives forever is 
strictly positive. Thus the case where E[Zi] = 1 is referred to as the critical case and a 
fundamental result of the behavior in this case is the following. 

Theorem 5.8 (Yaglom's Theorem). Let 1 = Zq, Zi, . . . be the generation sizes of a Galton- 

Watson branching process where E[Zi] = 1 and Var(Z'i) = cr^ < oo. If Yn = {Zn\Zn > 0) 
and Z ~ Exp(l), then 



We will provide a rate of convergence in this theorem under a stricter moment assumption. 

Theorem 5.9. Let Zq, Zi, . . . as in Yaglom's Theorem above and assume also that E|Zi|'^ < 
oo. IfYn = {Zn\Zn > 0) and Z ~ Exp(l), then for some constant C, 



Proof. We will construct a copy of Yn and Y^ having the equilibrium distribution on the 
same space and then show that E|y„ — Y^\ ^ Clog(r;,). Once this is established, the result 

is proved by Theorem 5.4 and the fact that {cYnY = cY^ for any constant c. 

In order to couple Yn and Y^, we will construct a "size-bias" tree and then find copies of 
the variables we need in it. The clever construction we will use is due to [38] and implicit in 
their work is the fact that E|F„ - Y^\/n (used to show that nP(Z„ > 0) 2/(T^), but 
we will weed out a rate from their analysis. 

We view the size-bias tree as labeled and ordered, in the sense that, if w and v are vertices 
in the tree from the same generation and w is to the left of v, then the offspring of w is to the 



z, 




i=l 
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left of the offspring of v. Start in generation with one vertex vq and let it have a number of 
offspring distributed according to the size-bias distribution of Zi. Pick one of the offspring 
of Vq uniformly at random and call it vi. To each of the siblings of vi attach an independent 
Galton- Watson branching process having the offspring distribution of Zi. For vi proceed as 
for Vq, i.e., give it a size-bias number of offspring, pick one uniformly at random, call it V2, 
attach independent Galton- Watson branching process to the siblings of V2 and so on. It is 
clear that this process will always give an infinite tree as the "spine" vq,vi,V2, . . . will be 
infinite. See Figure 1 of [38] for an illustration of this tree. 

Now, for a fixed tree t, let Gn{t) be the chance that the original branching process driven 
by Zi agrees with t up to generation n, let G^(t) be the chance that the size-bias tree just 
described agrees with t up to generation n, and for v an individual of t in generation n, let 
G^{t,v) be the chance that the size-bias tree agrees with t up to generation n and has the 
vertex v as the distinguished vertex Vn in generation n. We claim that 

G:,it,v)=G4t). (5.7) 

Before proving this claim, we note some immediate consequences which imply that our size- 
bias tree naturally contains a copy of Y^. Let Sn be the size of generation n in the size-bias 
tree. 

1. Sn has the size-bias distribution of 

2. If has the size-bias distribution of Yn, then Sn = Y^. 

3. Given Sn, Vn is uniformly distributed among the individuals of generation n. 

4. If Rn is the number of individuals to the right (inclusive) of Vn and U is uniform on 
(0, 1), independent of all else, then Y^ := Rn — U has the equilibrium distribution of 

Yn. 

To show the first item, note that (5.7) implies 

G:(t) = t„G„(t), (5.8) 

where tn is the number of individuals in the nth generation of t. Now, P(5'„ = k) is obtained 
by integrating the left hand side of (5.8) over trees t with tn = k, and performing the same 
integral on the right hand side of (5.8) yields kP{Zn = k). The second item follows from 
the more general fact that conditioning a non-negative random variable to be positive does 
not change the size-bias distribution. Item 3 can be read from the right hand side of (5.7), 
since it does not depend on v. For Item 4, Item 3 implies that Rn is uniform on {1, ... , Sn} 
so that Rn — U = USn, from which the result follows from Item 2 and Proposition 5.5. 

At this point, we would like to find a copy of Yn in the size-bias tree, but before proceeding 
further we prove (5.7). Since trees formed below distinct vertices in a given generation are 
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independent, we will prove the formula by writing down a recursion. To this end, for a 
given planar tree t with k individuals in the first generation, label the subtrees with these k 
individuals as a root from left to right hj ti,t2, ■ ■ ■ ,tk. Now, a vertex v in generation n + 1 
of t lies in exactly one of the subtrees ti, . . . , tfc, say ti. With this setup, we have 

Gl^,it,v) = GniU,v)l\Gn{t,)[kFiZ, = k)]^. 

The first factor corresponds to the chance of seeing the tree ti up to generation n+1 below the 
distinguished vertex vi and choosing v as the distinguished vertex in generation n + 1. The 
second factor is the chance of seeing the remaining subtrees up to generation n + 1, and the 
remaining factors correspond to having k offspring initially (with the size-bias distribution 
of Zi) and choosing vertex vi (the root of ti) as the distinguished vertex initially. With this 
formula in hand, it is enough to verify that (5.7) follows this recursion. 

We must now find a copy of Yn in our size-bias tree. If L„ is the number of individuals 
to the left of Vn (exclusive, so Sn = Ln + Rn), then we claim that 



Indeed, we have 



SMLn = 0} = Yn. (5.9) 



^ ' ^ P(L„ = 0) 

P(^„ = k) P{Zn = k) 



fcP(L„ = 0) P(L„ = 0)' 
where we have used Items 2 and 3 above and the claim now follows since 

P(L„ = 0) = 5^P(L„ = 0\S^ = k)F{S^ = k) = J2 = > 0)- 

We are only part of the way to finding a copy of Yn in the size-bias tree since we still need 
to realize Sn given the event L„ = 0. Denote by Sn.j the number of particles in generation n 
that stem from any of the siblings of vj (but not vj itself). Clearly, Sn = '^ + J2^=i ^nj, where 
the summands are independent. Likewise, let Lnj and Rn,j, be the number of particles in 
generation n that stem from the siblings to the left and right of Vj (exclusive) and note that 
Ln^n and Rn,n are just the number of siblings of Vn to the left and to the right, respectively. 
We have the relations L„ = Lnj and i?„, = 1 + Yl^=i ^n,j- Note that for fixed j, Lnj 
and Rnj are in general not independent, as they are linked through the offspring size of 

Let now R'^ j be independent random variables such that 

R'nJ = Rn,j\{Ln,j = 0}. 
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and 



^n,j — ^nj^LnJ — 0] + -R^jI[L„,j > 0] — RnJ + {R'nJ ~ Rn,j)'^[LnJ > 0]. 

Finally, if i?* = 1 + Yl]=i ^nj^ then (5.9) implies that we can take Yn := -R* . 

Having coupled 1^ and Y^, we can now proceed to show E|y^ — Yn\ = O(log(?i)). By- 
Item 4 above, 



\Y — Y^l 



l-U + J2iRn,-Rn,MLn,>0] 



n 



^\1-U\ +J2KALn,j > 0] +Y,RnALn,j > O] . 

i=i i=i 
Taking expectation in the inequality above, our result will follow after we show that 

(i) E [KA^n,j > 0]] ^ a^F{L^,, > 0), 
(it) E [RnALnj > 0]] ^ E[Z3]P(Z„_, > 0), 

(Hi) F{Lnj > 0) ^ o-^P{Zn-j > 0) ^ C(r2 - j + 1)"^ for some C > 0. 

For part (i), independence implies that 

E [KA^n, > 0]] = EKJP(L„,, > 0), 

and using that Snj and '^[Lnj = 0] are negatively correlated (in the second inequality) below, 
we find 

E[i?y =E[i?„,,|L„,, = 0] 

^E[Sn,j-l\Lnj = 0] 

< nsnj - 1 

For part (ii), if Xj denotes the number of siblings of f j, having the size-bias distribution 
of Zi minus 1, we have 

E [RnALn,, > 0]] ^ HXjI[Lnj > 0]] 

^ ^A;P(Xj = k,Ln,j > 0) 

k 

^ kF{Xj = k)F{Lnj > 0\Xj = k) 

k 

^ ^ fc^P(Xj = k)F{Zn-j > 0) 
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^ E[Z3]P(Z„„, > 0), 



where we have used that E[_R„jI[L„j- > 0]|Xj] ^ XjI[Lnj > 0] in the first inequahty and 
that P{Lnj > 0\Xj = k) ^ kP{Zn-j > 0) in the penultimate inequahty. 
Finally, we have 

P(L„j >0) = E[P(L„j >0|X,)] 
^ E [X,P{Zn-j > 0] 
^ a^P{Zn-j > 0). 

Using Kolmogorov's estimate (see Chapter 1, Section 9 of [7]), we have lim„_>.oo ''^P(-Z'„ > 
0) = 2/cr^, which implies the final statement of (iii). □ 

6 Geometric approximation 

Due to Example 5.4, if ly > is integer-valued such that iy/E[14^] is approximately ex- 
ponential and Pj[W] is large, then we expect that W will be approximately geometrically 
distributed. In fact, if we write p][W] = l/p, and let X ~ Ge(p) and Z ~ Exp(l), then the 
triangle inequality implies that 

\d^{pW,Z)-dy,{X,W)\^p. 

However, if we want to bound cItyCW, X), then the inequality above is not useful. For 

example, ii W = kX for some positive integer k, then dTviW, X) ^ {k — l)/k since the 
support of W and X do not match. This issue of support mismatch is typical in bounding 
the total variation distance between integer- valued random variables and can be handled by 
introducing a term into the bound that quantifies the "smoothness" of the random variable 
of interest. 

The version of Stein's method for geometric approximation which we will discuss below 
can be used to handle these types of technicalities [41], but the arguments can be a bit 
tedious. Thus, we will develop a simplified version of the method and apply it to an example 
where these technicalities do not arise and where exponential approximation does not hold. 

We parallel the development of Stein's method for exponential approximation above, so 
we will move quickly through the initial theoretical framework; our work below follows [41]. 

6.1 Main theorem 

A typical issue when discussing the geometric distribution is whether to have the support 
begin at zero or one. In our work below we will focus on the geometric distribution which 
puts mass at zero; that is iV ~ Ge°(p) if for = 0, 1, . . ., we have P(A^ = k) = {1 — p)^p . 
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Developing the theory below for the geometric distribution with positive support is similar 
in flavor, but different in detail - see [41]. 

As usual we begin by defining the characterizing operator that we will use. 

Lemma 6.1. Define the functional operator A by 

Af{k) = (1 - p)Af{k) - pf{k) + pf{0). 

1. If Z ^ Ge°(p), then E^/(Z) = for all hounded f. 

2. If for some non-negative random variable W, EiAf{W) = for all bounded f, then 
W ~ Ge°(p). 

The operator A is referred to as a characterizing operator of the geometric distribution. 

We now state the properties of the solution to the Stein equation that we need. 
Lemma 6.2. //Z ~ Ge^{p), A C N U {0}, and fA is the unique solution with /a(0) = of 

(1 - p)AfA{k) - pfAik) = I[k eA]- F{Z e A), 

then -1 ^ Af{k) ^ 1. 

These two lemmas lead easily to the following result. 

Theorem 6.3. Let J-" be the set of functions with /(O) = and || A/|| ^ 1 and letW'^0 be 
an integer-valued random variable with Ej[W] = (1 —p)/p for some 0<p^ 1. IfN r^Ge\p), 
then 

dTyiW,N) ^ sup |E[(1 -p)A/(iy) -pfiW)]\ . 

Before proceeding further, we briefly indicate the proofs of Lemmas 6.1 and 6.2. 

Proof of Lemma 6.1. The first assertion is a simple computation while the second can be 
verified by choosing f{k) = I[k = j] for each j = 0, 1, . . . which yields a recursion for the 
point probabilities for W. □ 

Proof of Lemma 6.2. After noting that 

fA{k) = Y,{^-pr - J2 i^-py-^ 

we easily see 

AfA{k) = i[keA]-p (i-^r'^'' 

ieA,i^k+l 

which is the difference of two non-negative terms, each of which is bounded above by one. □ 

It is clear from the form of the error of Theorem 6.3 that it may be fruitful to attempt to 
define a discrete version of the equilibrium distribution used in the exponential approximation 
formulation above, which is the program we will follow. An alternative coupling is used 
in [39]. 
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6.2 Discrete equilibrium coupling 

We begin with a definition. 

Definition 6.1. Let W ^ a random variable with I^[W] = (1 — p)/p for some < p ^ 1. 
We say that has the discrete equilibrium distribution with respect to W if for all functions 
/ with ||A/|| < cx), 

pE[f{W)]-pf{0) = (1 -p)E[Af{W% (6.1) 

The following result provides a constructive definition of the discrete equilibrium distri- 
bution, so that the right hand side of (6.1) defines a probability distribution. 

Proposition 6.4. Let W ^ be an integer-valued random variable with E[W] = (1 —p)/p 
for some < p ^ 1 and let have the size-bias distribution of W . If conditional on W^, 
is uniform on {0, 1, ... , — 1}, then has the discrete equilibrium distribution with 
respect to W . 

Proof. Let / be such that || A/|| < oo and /(O) = 0. If is uniform on {0, 1, . . . , H^" - 1} 
as dictated by the proposition, then 



E[A/(iy")] = E 



W 

i=0 



E 



f{W' 



i-p 



where in the final equality we use the definition of the size-bias distribution. □ 

Remark 6.2. This proposition shows that the equilibrium distribution is the same as that 
from renewal theory - see Remark 5.3. 

Theorem 6.5. Let N ~ Ge^{p) and W ^ an integer-valued random variable with E[l^] = 
(1 — p)/p for some < p ^1 such that Ej[W^] < oo. If has the equilibrium distribution 
with respect to W and is coupled to W , then 

dTY{W, N) ^ 2(1 - p)E\W - W\. 

Proof If /(O) = and ||A/|| ^ 1, then 

\E[{1 -p)Af{W)-pf{W)]\ = {l-p)\E[Af{W) - Af{W^)]\ 

^ 2(1 -p)E\W^ - W\, 

where the inequality follows after noting that Af{W) — AfCW^) can be written as a sum 
of \W^ - W\ terms each of size at most \Af{W + z + 1) - Af{W + t)\ ^ 2||A/||. Applying 
Theorem 6.3 now proves the desired conclusion. □ 
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6.3 Application: Uniform attachment graph model 

Let Gn be a directed random graph on n nodes defined by the following recursive construc- 
tion. Initially the graph starts with one node with a single loop where one end of the loop 
contributes to the "in-degree" and the other to the "out-degree." Now, for 2 ^ m ^ n, 
given the graph with m — 1 nodes, add node m along with an edge directed from m to a 
node chosen uniformly at random among the m nodes present. Note that this model allows 
edges connecting a node with itself. This random graph model is referred to as uniform 
attachment. We will prove the following geometric approximation result (convergence was 
shown without rate in [15]), which is weaker than the result of [41] but has a slightly simpler 
proof. 

Theorem 6.6. IfW is the in-degree of a node chosen uniformly at random from the random 
graph Gn generated according to uniform attachment and N ~ Ge'^(l/2), then 

n 

Proof. Let Xi have a Bernoulli distribution, independent of all else, with parameter /Xj := 
(n — z + 1)~^, and let be an independent random variable that is uniform on the integers 
1,2, ... ,n. If we imagine that node n + 1 — N is the randomly selected node, then it's easy 
to see that we can write W := Xlili -^i- 

Next, let us prove that := YliLi^ -^i has the discrete equilibrium distribution w.r.t. 
W. First note that we have for bounded / and every m, 

^^^EAf^J2^) =^ 

^ i=i ' 

where we use 

E/(X^)-/(0) = EX^EA/(0) 

and thus the fact that we can write = 0. Note also that for any bounded function g 
with 5f(0) =0 we have 

We now assume that /(O) = 0. Hence, using the above two facts and independence between 
and the sequence Xi, X2, . . . , we have 

EWEAf{W^°) = Ef{W). 

Since W - = X^, we have - W\ = ^ ELil^ " ^ + 1)^^ and the result follows 

upon applying Theorem 6.5. □ 
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7 Concentration Inequalities 



The techniques we have developed for estimating expectations of characterizing operators 
(e.g. exchangeable pairs) can also be used to prove concentration inequalities (or large 
deviations). By concentration inequalities, we mean estimates of P{W ^ t) and P{W ^ —t), 
for t > and some centered random variable W. Of course our previous work was concerned 
with such estimates, but here we are after the rate that these quantities tend to zero as t 
tends to infinity - in the tails of the distribution. Distributional error terms are maximized 
in the body of the distribution and so typically do not provide information about the tails. 

The study of concentration inequalities have a long history and have also found recent 
use in machine learning and analysis of algorithms - see [16] and references therein for a 
flavor of the modern considerations of these types of problems. Our results will hinge on the 
following fundamental observation. 

Proposition 7.1. IfW is random variable and there is a 6 > such that E[e^'^] < oo for 
all 9 e (-5, 5), then for all t > and < 9 < 5, 

(^^d F{W < -t) ^ ' ' 



^6t ^ ^ ^9t 



Proof. Using first that is an increasing function, and then Markov's inequality, 

F{W >t) = F (e^^ > e'') ^ 

which proves the first assertion. The second assertion follows similarly. □ 

Before discussing the use of Proposition 7.1 in Stein's method, we first work out a couple 
of easy examples. 

Example 7.1 (Normal distribution). Let Z have the standard normal distribution and recall 
that for t > we have the Mills ratio bound 



tV27r 

A simple calculation implies E[e^^] = e^^^^ for all ^ G R, so that for 9,t>0 Proposition 7.1 
implies 

P(Z ^ t) ^ e'"/^-'\ 
and choosing the minimizer 9 = t yields 

F{Z ^t)^ e"*'/^ 

which implies that this is the best behavior we can hope for using Proposition 7.1 in examples 
where the random variable is approximately normal (such as sums of independent variables). 
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Example 7.2 (Poisson distribution). Let Z have the Poisson distribution with mean A. 
A simple calculation implies E[e^^] = exp{A(e^ — 1)} for all 6* G R, so that for 6*,^ > 
Proposition 7.1 implies 

P(Z - A ^ t) ^ exp{A(e^ - 1) - Oit + A)}, 
and choosing the minimizer Q = log(l + t/X) yields 

P{Z-X^t) ^expi-t (log(l + ^ ) -1 ) -Alogfl ^ 



which is of smaller order than e"*^* for t large and fixed c > 0, but of bigger order than 
g-tiog{t) "Yhis is the best behavior we can hope for using Proposition 7.1 in examples where 
the random variable is approximately Poisson (such as sums of independent indicators, each 
with a small probability of being one). 

How does Proposition 7.1 help us use the techniques from Stein's method to obtain 
concentration inequalities? If W is random variable and there is a 5 > such that E[e^^] < 
oo for all 6 G (—5, 6), then we can define m(^) = E[e^^] for < 6 < 6, and we also have that 
m'{e) = E[We^^]. Thus m'{e) is of the form E[Wf{W)], where f{W) = e^^ so that we can 
use the techniques that we developed to bound the characterizing operator for the normal 
and Poisson distribution to obtain a differential inequality for m{9). Such an inequality will 
lead to bounds on m{6) so that we can apply Proposition 7.1 to obtain bounds on the tail 
probabilities of W. This observation was first made in [17]. 

7.1 Concentration using exchangeable pairs 

Our first formulation using the couplings of Sections 3 and 4 for concentration inequalities 
uses exchangeable pairs. We follow the development of [18]. 

Theorem 7.2. Let {W, W) an a-Stem pair withViir{W) = < oo. IfE[e'^^\W'-W\] < oo 
for all 6 & H and for some sigma- algebra T ^ cr(W^) there are non-negative constants B and 
C such that 

then for all t > 0, 

-e ^ . „ r -t^ 



¥{W ^t) ^ exp <^ — — } and F{W ^ -t) ^ exp 



2C + 2Bt\ ' ^ ' ^ ' [2C 
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Remark 7.3. The reason the left tail has a better bound stems from condition (7.1) which 
implies that BW + C ^ 0. Thus, the condition essentially forces the centered variable W 
to be bounded from below whereas there is no such requirement for large positive values. 
As can be understood from the proof of the theorem, conditions other than (7.1) may be 
substituted to yield different bounds; see [19]. 

Before proving the theorem, we apply it in a simple example. 

Example 7.4 (Sum of independent variables). Let Xi, . . . , Xn be independent random vari- 
ables with /ij := E[Xj], af := Var(Xj) < oo and define W = X^ — fii. Let X[, . . . , X'^ be 
an independent copy of the Xi and for / independent of all else and uniform on {1, . . . , n}, 
let W = W — Xi + X'j so that as usual, {W, W) is a l/ra-Stein pair. We consider two special 
cases of this setup. 

1. For i = 1, . . . ,n, assume \Xi — fii\ ^ Ci. Then clearly the moment generating function 
condition of Theorem 7.2 is satisfied and we also have 

n 

E[{w'-wnx,),^,] = -j2nixi-x,r\ix,),^,] 

= -Y.^[{x^-^^^?] + -J2(x^-^^.r 

n ^-^ n ^ — ' 

i=l 1=1 
i=l 

SO that we can apply Theorem 7.2 with B = and 2C = J21^=ii^i + ^f)- have 
shown that for t > 0, 

P(|l^-E[Vr]| ^t) ^2exp 

which is some version of Hoeff ding's inequality [35]. 

2. For i = 1, . . . ,n, assume ^ ^ 1. Then the moment generating function condition 
of the theorem is satisfied and 

mw'-wnx,),^,] = -x;e[(x:-x,)^|(x,),,i] 

^ i=i 

1 " 

= -Y.E[{X:f]-2f,,X, + Xf 

i=l 

1 " 1 

n ^-^ n 

i=l 
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where fi := E[iy] and we have used that —2fiiXi ^ and Xf ^ Xj. We can now apply 
Theorem 7.2 with B = 1/2 and C = /i to find that for t > 0, 

P - /i| ^ t) ^ 2exp {-^^} • (7.2) 

Note that if fi is constant in n, then (7.2) is of the order e~* for large t, which according 
to Example 7.2 is similar to the order of Poisson tails. However, if fi and o"^ := Var(14^) 
are both going to infinity at the same rate, then (7.2) implies 

prti^^tU2exp^ 



a J 1 24 



so that for large, the tails are of order e , which according to Example 7.1 is 
similar to the order of Gaussian tails. 

Proof of Theorem 7.2. Let m{e) = E[e^^] and note that m'{9) = E[We^^]. Since {W,W') 
is an a-Stein pair, we can use (3.17) to find that for all / such that E|iy/(iy)| < oo, 

E[{W' - W){f{W') - f{W))] = 2aE[l^/(l^)]. 

In particular, 

and in order to bound this term, we use the convexity of the exponential function to obtain 
for X > y 



[ exp{tx + (1 - t)y}dt ^ [ te^ + (1 - t)e^rft = (7.4) 
Jo Jo 2 



•1 rl 

x-y 

Combining (7.3) and (7.4), we find that for all 6 E H, 



\m'{9)\ ^ \9\ 



\0\ 



4a 

E[{W' - Wfe'^] 



2a 

^ \e\E[{BW + C)e^^] 

^B\9\m'{e) + C\e\m{e), (7.5) 

where the equality is by exchangeability and the penultimate inequality follow from the 
hypothesis (7.1). Now, since m is convex and m'(0) = 0, we find that m'{6)/6 > for 6* 7^ 0. 
We now break the proof into two cases, corresponding to the positive and negative tails of 
the distribution of W . 



75 



6 > 0. In this case, our calculation above implies that for < 9 <1/B, 

\og[m[6)) = — — ^ 



de ' " m{e) I- Be' 

which yields that 

'°^("('">«rT^''"«2(r^»)^ 

and from this point we easily find 



m{6) ^ exp 



2(1-5^) 

According to Proposition 7.1 we now have for t > and < < 1/5, 

¥{W ^t) ^ exp <^ — — - et 



2(1 - 39) 

and choosing 9 = t/{C + Bt) proves the first assertion of the theorem. 
6' < 0. In this case, since m'{9) < 0, (7.5) is bounded above by —C9m{9) which implies 

C9 ^ ^\og{m{9)) < 0. 
d9 

From this equation, some minor consideration shows that 

C9^ 

log{m{9)) ^ 



2 

According to Proposition 7.1 we now have for t > and ^ < 0, 

r C9^ 

P{W ^ -t) ^ exp <^ ^ + 
and choosing 9 = —t/C proves the second assertion of the theorem. 



Example 7.5 (Hoeffding's combinatorial CLT). Let (ajj)i^j.j<g„ be an array of real numb 
and let a be a uniformly chosen random permutation of {1, . . . , n}. Let 



n ^ 

i=l i,j 
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and define a' = ar, where r is a uniformly chosen transposition and 

n ^ 
W^' = J]a^ --X^^^i- 



1=1 i,j 



It is a straightforward exercise to show that that E[iy] = and {W, W) is a 2/(n — 1)-Stein 
pair so that it may be possible to apply Theorem 7.2. In fact, we have the following result. 



Proposition 7.3. // W is defined as above with ^ ajj ^ 1, then for all t > 0, 

P{\W\ ^ t) ^ 2exp 



-t^ 



- E,- aii + 2t \ ' 

n ^i,] '■'J ) 

Proof. Let (PV, ly ) be the 2/{n — 1)-Stein pair as defined in the remarks preceding the 
statement of the proposition. We now have 



E[(iy' - Wf\a] = \ J2 + "i-. - - "i-J' 



2 

n(n — 1) 

4 ... 8 
n — 1 n(n — 1) ^ 



so that we can apply Theorem 7.2 with B = 1 and C = (2/n) Yli ctij; to prove the result. □ 



7.2 Application: Magnetization in the Curie- Weiss model 

Let /3 > 0, /i G R and for a G { — 1,1}" define the Gibbs measure 

P(a) = expl^Yl + /^hj^^^il , (7.6) 

L i<j i J 

where Z is the appropriate normalizing constant (the so-called "partition function" of sta- 
tistical physics). 

We think of a as a configuration of "spins" (±1) on a system with n sites. The spin of 
a site depends on those at all other sites since for all i ^ j, each of the terms (TjCTj appears 
in the first sum. Thus, the most likely configurations are those that have many of the spins 
the same spin (+1 if /i > and — 1 if /i < 0). This probability model is referred to as the 
Curie- Weiss model and a quantity of interest is m = ^ EILi "^j' "magnetization" of the 
system. We will show the following result found in [18]. 
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Proposition 7.4. If m = ^ Yl'i=i /^'^ all (3 > 0, h e M, and t^O, 

( Q t \ { 

P |m-tanh(/3m + /3/i)| ^ - + ^ ^2exp ' 



n ' ^ " i 4(l + /3) 

•where tanh(x) := (e^ — e~^)/(e^ + e~^). 

In order to the prove the proposition, we need a more general result than that of Theorem 
7.2; the proofs of the two results are very similar. 

Theorem 7.5. Let (X, X') an exchangeable pair of random elements on a Polish space. Let 
F an antisymmetric function and define 

/(X) :=E[F(X,X')|X]. 

//E[e''^(^)|F(X,X')|] < oo for all9 eR and there are constants B,C ^0 such that 

[|(/(X) - /(X'))F(X,X')| |X] ^ Bf{X) + C, 

then for all t > 0, 

P(/(X) >t)^ exp L'!"'^^ } and P(/(X) ^ -t) ^ exp 



^2C + 2Bt) w V ^ - ^ - ^ [ 2C 

In order to recover Theorem 7.2 from the result above, if {W, W) is an a-Stein pair, then 
we can take F{W, W) = {W - W')/a, so that f{W) := E[F{W, W')\W] = W. 

Proof of Proposition 7.4- In the notation of Theorem 7.5, we will set X = a and X' = a', 
where a is chosen according to the Gibbs measure given by (7.6) and a' is a step from a in 
the following reversible Markov chain: at each step of the chain a site from the n possible 
sites is chosen uniformly at random and then the spin at that site is resampled according to 
the Gibbs measure (7.6) conditional on the value of the spins at all other sites. This chain 
is most commonly known as the Gibbs sampler. We will take F{a,a') = X]r=i('^* ~ "^D 
that we will use Theorem 7.5 to study /(a) = E[F((T, cr')|cr]. 

The first thing we need to do is compute the transition probabilities for the Gibbs sampler 
chain. Suppose the chosen site is site i, then 
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Note that P{{aj)j^i) = P(crj = 1, (aj)j^i) + P(o"j = —1, ((Tj)jyj), and that 



Thus 



P(a: = -l|(a,),v. 



exp j 




exp j 




[ + exp j 





exp j 


r /3 








exp j 




[ + exp j 





and hence 



E[F{a, a')\a] = ^ E - ^ ^tanh J] a, + ^h) 

1=1 i=l \ jj^i / 



(7.7) 

where the summation over i and the factor of 1/n is due to the fact that the resampled site 
is chosen uniformly at random (note also that for j ^ i, E[o"j — aj\a and chose site i] = 0). 

We will give a concentration inequality for (7.7) using Theorem 7.5, and then show that 
the difference between (7.7) and the quantity of interest in the proposition is almost surely 
bounded by a small quantity, which will prove the result. 

If we denote := - y^,-,- a^, then 

1 " 

f{a) := E[F{a, a') |(t] = m - - ^ tanh{/3mi + i3h}, 

^ i=l 

and we need to check the conditions of Theorem 7.5 for /(a). The condition involving the 
moment generating function is obvious since all quantities involved are finite, so we only 
need to find constants B,C > such that 

^E[\{fia)-f{a'))F{a,a')\\a]^Bfia) + C. (7.8) 

Since F{a, a') is the difference of the sum of the spins in one step of the Gibbs sampler and 
only one spin can change in a step of the chain, we have |-F(cr, a')\ ^ 2. 
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Also, if we denote m' := ^ ^"^^ cr', then using that | tanh(x) — tanh(y)| ^ \x — y\ (which 
essentially follows from the inequality (7.4): 216^^ — e^| ^ |x — y\{e^ + e^')), we find 



\f((r) - f(a')\ ^ |m - m'l + - V Im^ - m'l ^ ^ 



n 



Hence, (7.8) is satisfied with B = and C 



2(l+/3) 



and Theorem 7.5 now yields 



P 



m 



1 " 

— tanh(/3mj + f3h) 

^ i=l 



> 



^ 2exp 



4(l+/3)i • 



To complete the proof we note that 



^ n 1 " 

- V [tanh(/3mi + /Sh) - tanh(/3m + l3h)] ^ - V |/3mi - /5m| ^ 
n ^-^ n ^-^ 



n 



i=l i=l 

and thus an application of the triangle inequality yields the bound in the proposition. □ 



7.3 Concentration using size-bias couplings 



As previously mentioned, the key step in the proof of Theorem 7.2 was to rewrite m'{9) : = 
EflVe^'^] using exchangeable pairs in order to get a differential inequality for m{6). We can 
follow this same program, but with a size-bias coupling in place of the exchangeable pair. 
We follow the development of [29]. 

fi and < Var(X) = cr^ < oo and let X'^ be a 
<oo. 



Theorem 7.6. Let X ^ with E[X] 



size-biased coupling of X such that \X — X^ 
1. IfX' ^ X, then 

'X 



P 



/i 



(7 



^ -t ^ exp 



2(^) 



2. Ifm{e) = E[e^^] < oo for 9 = 2/C, then 

'X-fi 



P 



a 



^ t ] ^ exp 



2 & + m 



Proof. To prove the first item, let 6' ^ so that m{9) := E[e^'^] < oo since X 0. As in 
the proof of Theorem 7.2, we will need the inequality (7.4): for all x,?/ G R, 



x-y 
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Using this fact and that ^ X, we find 

E[e^^ _ e^^^] ^ ^ (Eie''^] + E[e^^']) ^ C\e\¥.[e''']. (7.9) 

The definition of the size-bias distribution imphes that m'{6) = fiEj[e^^"] so that (7.9) 
yields the differential inequahty m'{6) ^ fi{l + C6)m{6), or put otherwise 

^[\og{m{e))-f,e];,f,ce. (7.10) 

Setting m{e) = \og{m{e)) - 126, (7.10) implies m{e) ^ i2Ce^/2, and it follows that 



E 



exp < 9 



— > = exD < m \ — \ > < exD — 



a 

We can now apply Proposition 7.1 to find that 



F{^<-t] ^expf/^ + ^t). (7.11 



The right hand side of this (7.11) is minimized at 6* = —a'^t/fiC, and substituting this value 
into (7.11) yields the first item of the theorem. 

For the second assertion of the theorem, suppose that ^ 9 < 2/C. A calculation similar 
to (7.9) above shows that 

m'(e) ce fm'ie) 



A* 2 \ /i 

so that we can write 

u (1 + ^) 



Again defining rh{6) = \og{m{6)) — jj,6, we have m'{6) ^ Cfi6/{1 — ^) so that 

m (-^ ^ 2 CT^ ^ ^ < min{2/C, 2a/C}. 

We can now apply Proposition 7.1 to find that 

The right hand side of this (7.12) is minimized at 
and substituting this value into (7.12) yields the second item of the theorem. □ 
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Theorem 7.6 can be applied in many of the examples we have discussed in the context 
of the size-bias transform for normal and Poisson approximation and others. We content 
ourselves with a short example and refer to [28] for many more applications. 

Example 7.6. Let Yi, . . . ,Yn be i.i.d. Be{p), fix ^ 1, and define Xi = YYjt^r^ with the 
bounds being modular. Further define X = Y17=i ^« ^^"^ define X* by sampling Xi, . . . , X^ 
and forcing Xi = 1. Conditional on this, the rest of the Yi are as before. If 



X 



(fc) J 1 if there exists a head run of length k ai i after forcing 
otherwise, 



then X" = 1 + EJ=2^i • Note that X" ^ X and |X" - X| ^ 2fc - 1. In this case Theorem 
7.6 implies 



P 





X -fi 




( 


■'•) 




a 





2(2^-1) (5 + ^) 
where \i = np^ and = /i ^1 — + ~ P^)) • 



Acknowledgments 

The author thanks the students who sat in on the course that this document is based. Special 
thanks go to those students who contributed notes, parts of which may appear in some form 
above: Josh Abramson, Miklos Racz, Douglas Rizzolo, and Rachel Wang. 



References 

[1] D. Aldous and J. Fill. Reversible Markov chains and random walks on graphs, 
http: //www. Stat .berkeley . edu/~aldous/RWG/book.htnil, 2010. 

[2] R. Arratia, A. D. Barbour, and S. Tavare. Logarithmic combinatorial structures: a prob- 
abilistic approach. EMS Monographs in Mathematics. European Mathematical Society 
(EMS), Ziirich, 2003. 

[3] R. Arratia and L. Goldstein. Size bias, sampling, the waiting time paradox, and infinite 
divisibility: when is the increment independent? http://arxiv.org/abs/1007.3910, 
2011. 

[4] R. Arratia, L. Goldstein, and L. Gordon. Two moments suffice for Poisson approxima- 
tions: the Chen-Stein method. Ann. Probab., 17(l):9-25, 1989. 



82 



[5] R. Arratia, L. Goldstein, and L. Gordon. Poisson approximation and the Chen-Stein 
method. Statist. Sci., 5(4):403-434, 1990. With comments and a rejoinder by the 
authors. 

[6] R. Arratia, L. Gordon, and M. S. Waterman. The Erdos-Renyi law in distribution, for 
coin tossing and sequence matching. Ann. Statist., 18(2):539-570, 1990. 

[7] K. B. Athreya and P. E. Ney. Branching processes. Springer- Verlag, New York, 1972. 
Die Grundlehren der mathematischen Wissenschaften, Band 196. 

[8] F. Avram and D. Bertsimas. On central limit theorems in geometrical probability. Ann. 
Appl. Probab., 3(4):1033-1046, 1993. 

[9] P. Baldi, Y. Rinott, and C. Stein. A normal approximation for the number of local 
maxima of a random function on a graph. In Probability, statistics, and mathematics, 
pages 59-81. Academic Press, Boston, MA, 1989. 

[10] A. D. Barbour and L. H. Y. Chen, editors. An introduction to Stein's method, volume 4 
of Lecture Notes Series. Institute for Mathematical Sciences. National University of 
Singapore. Singapore University Press, Singapore, 2005. Lectures from the Meeting 
on Stein's Method and Applications: a Program in Honor of Charles Stein held at the 
National University of Singapore, Singapore, July 28-August 31, 2003. 

[11] A. D. Barbour, L. Hoist, and S. Janson. Poisson approximation, volume 2 of Oxford 
Studies in Probability. The Clarendon Press Oxford University Press, New York, 1992. 
Oxford Science Publications. 

[12] A. D. Barbour, M. Karohski, and A. Rucihski. A central limit theorem for decomposable 
random variables with applications to random graphs. J. Combin. Theory Ser. B, 
47(2):125-145, 1989. 

[13] A. C. Berry. The accuracy of the Gaussian approximation to the sum of independent 
variates. Trans. Amer. Math. Soc, 49:122-136, 1941. 

[14] B. Bollobas. Random graphs, volume 73 of Cambridge Studies in Advanced Mathematics. 
Cambridge University Press, Cambridge, second edition, 2001. 

[15] B. Bollobas, O. Riordan, J. Spencer, and G. Tusnady. The degree sequence of a scale-free 
random graph process. Random Structures Algorithms, 18(3):279-290, 2001. 

[16] O. Bousquet, S. Boucheron, and G. Lugosi. Concentration inequalities. In Advanced 
Lectures on Machine Learning: ML Summer Schools 2003, Lecture Notes in Artificial 
Intelligence, pages 208-240. Springer, 2004. 



83 



S. Chatterjee. Concentration inequalities with exchangeable pairs. 

http://arxiv.org/abs/math/0507526, 2005. Ph.D. dissertation, Stanford Uni- 
versity. 

S. Chatterjee. Stein's method for concentration inequalities. Probab. Theory Related 
Fields, 138(l-2):305-321, 2007. 

S. Chatterjee and P. S. Dey. Applications of Stein's method for concentration inequali- 
ties. Ann. Probab., 38(6):2443-2485, 2010. 

S. Chatterjee, P. Diaconis, and E. Meckes. Exchangeable pairs and Poisson approxima- 
tion. Probab. Surv., 2:64-106 (electronic), 2005. 

S. Chatterjee, J. Fulman, and A. Rollin. Exponential approximation by Stein's method 
and spectral graph theory. ALEA Lat. Am. J. Probab. Math. Stat., 8:197-223, 2011. 

S. Chatterjee and Q.-M. Shao. Nonnormal approximation by Stein's method of ex- 
changeable pairs with application to the Curie- Weiss model. Ann. Appl. Probab., 
21(2):464-483, 2011. 

S. Chatterjee and Students. Stein's method course notes, 

http : //www. Stat .berkeley . edu/~ sourav/ stat206Af all07 .html, 2007. 

L. H. Y. Chen, L. Goldstein, and Q.-M. Shao. Normal approximation by Stein's method. 
Probability and its Applications (New York). Springer, Heidelberg, 2011. 

L. H. Y. Chen and Q.-M. Shao. Normal approximation under local dependence. Ann. 
Probab., 32(3A):1985-2028, 2004. 

P. Diaconis and S. Holmes, editors. Stein's method: expository lectures and applications. 
Institute of Mathematical Statistics Lecture Notes — Monograph Series, 46. Institute of 
Mathematical Statistics, Beachwood, OH, 2004. Papers from the Workshop on Stein's 
Method held at Stanford University, Stanford, CA, 1998. 

P. Donnelly and D. Welsh. The antivoter problem: random 2-colourings of graphs. In 

Graph theory and combinatorics (Cambridge, 1983), pages 133-144. Academic Press, 
London, 1984. 

S. Ghosh and L. Goldstein. Applications of size biased couplings for concentration of 
measures. Electronic Communications in Probability, 16:70-83, 2011. 

S. Ghosh and L. Goldstein. Concentration of measures via size-biased couplings. Prob- 
ability Theory and Related Fields, 149:271-278, 2011. 10.1007/s00440-009-0253-3. 



84 



L. Goldstein. A probabilistic proof of the Lindeberg-Feller central limit theorem. Amer. 
Math. Monthly, 116(l):45-60, 2009. 

L. Goldstein. A Berry-Esseen bound with applications to counts in the Erdos-Renyi 
random graph, http://arxiv.org/abs/1005.4390, 2010. 

L. Goldstein and G. Reinert. Stein's method and the zero bias transformation with 
application to simple random sampling. Ann. Appl. Probab., 7(4):935-952, 1997. 

L. Goldstein and Y. Rinott. Multivariate normal approximations by Stein's method and 
size bias couplings. J. Appl. Probab., 33(1):1-17, 1996. 

G. R. Grimmett and D. R. Stirzaker. Probability and random processes. Oxford Uni- 
versity Press, New York, third edition, 2001. 

W. Hoeffding. Probability inequalities for sums of bounded random variables. J. Amer. 
Statist. Assoc., 58:13-30, 1963. 

S. Janson, T. Luczak, and A. Rucinski. Random graphs. Wiley-Interscience Series in 
Discrete Mathematics and Optimization. Wiley-Interscience, New York, 2000. 

T. M. Liggett. Interacting particle systems. Classics in Mathematics. Springer- Verlag, 
Berlin, 2005. Reprint of the 1985 original. 

R. Lyons, R. Pemantle, and Y. Peres. Conceptual proofs of LlogL criteria for mean 
behavior of branching processes. Ann. Probab., 23(3):1125-1138, 1995. 

E. Pekoz. Stein's method for geometric approximation. J. Appl. Probab., 33(3):707-713, 
1996. 

E. Pekoz and A. Rollin. New rates for exponential approximation and the theorems of 
Renyi and Yaglom. Annals of Probability, 2010. To appear. 

E. Pekoz, A. Rollin, and N. Ross. Total variation and local limit error bounds for 
geometric approximation, http://arxiv.org/abs/1005.2774, 2010. 

J. Pitman. Probabilistic bounds on the coefficients of polynomials with only real zeros. 
J. Combm. Theory Ser. A, 77(2):279-303, 1997. 

G. Reinert. Three general approaches to Stein's method. In An introduction to Stein's 
method, volume 4 of Lect. Notes Ser. Inst. Math. Sci. Natl. Univ. Singap., pages 183- 
221. Singapore Univ. Press, Singapore, 2005. 

Y. Rinott and V. Rotar. On coupling constructions and rates in the CLT for dependent 
summands with applications to the antivoter model and weighted ^/-statistics. Ann. 
Appl. Probab., 7(4): 1080-1 105, 1997. 



85 



A. Rollin. A note on the exchangeability condition in Stein's method, 
http : //arxiv . org/abs/math/0611050vl, 2006. 

A. Rolhn. Stein couphngs for normal approximation, 

http : //arxiv . org/abs/1003 . 6039, 2010. 

A. Rollin and N. Ross. A probabilistic approach to local limit theorems with applications 
to random graphs, http://arxiv.org/abs/1011.3100, 2010. 

S. Ross and E. Pekoz. A second course in probability. www.ProbabilityBookstore.com, 
Boston, 2007. 

A. Rucihski. When are small subgraphs of a random graph normally distributed? 
Probab. Theory Related Fields, 78(1):1-10, 1988. 

Q.-M. Shao and Z.-G. Su. The Berry-Esseen bound for character ratios. Proc. Amer. 
Math. Soc, 134(7) :2153-2159 (electronic), 2006. 

C. Stein. A bound for the error in the normal approximation to the distribution of a 
sum of dependent random variables. In Proceedings of the Sixth Berkeley Symposium on 
Mathematical Statistics and Probability (Univ. California, Berkeley, Calif., 1970/1971), 
Vol. II: Probability theory., pages 583-602, Berkeley, Calif., 1972. Univ. California Press. 

C. Stein. Approximate computation of expectations. Institute of Mathematical Statistics 
Lecture Notes — Monograph Series, 7. Institute of Mathematical Statistics, Hayward, 
CA, 1986. 

M. Waterman. Introduction to computational biology. Chapman & Hall/CRC Interdis- 
ciplinary Statistics. CRC Press, 1995. 



86 



